Systems and methods for calculating recommendation scores based on combined signals from multiple recommendation systems

ABSTRACT

System and method for providing, in response to a search query, product recommendations based at least in part on a blend of recommendation signals from multiple product recommendation systems.

CROSS-REFERENCE TO RELATED APPLICATION

The present application claims the benefit of and priority to U.S. Provisional Patent Application No. 62/794,060, filed on Jan. 18, 2019, the entire contents of which are incorporated by reference herein

FIELD OF INVENTION

The present invention generally relates to systems and methods for responding to search queries. More particularly, the present invention relates to systems and methods for responding to search queries relating to products by calculating product recommendation scores based on combined signals from multiple recommendation systems.

SUMMARY OF THE INVENTION

Systems and methods for calculating recommendation scores based on combined signals from multiple recommendation systems are provided.

More particularly, the present invention relates to a computer implemented method comprising the steps of (a) receiving, at a computer system comprising one or more computers, a search query containing a search phrase from a user device; (b) executing, by the computer system, a query on a search index database to identify matches in response to the search query; (c) computing, by the computer system, facet counts associated with the matches; (d) ranking, by the computer system, the matches; (e) selecting, by the computer system, top K of the matches, wherein K is a predetermined number; (f) receiving, at the computer system from a plurality of recommendation systems, a plurality of recommendation signals generated by the plurality of recommendation systems based at least in part on the search phrase, the top K of the matches and the facet counts; (g) calculating, by the computer system, recommendation scores based at least in part on a combination of the plurality of recommendation signals from the plurality of recommendation systems in accordance with a scoring function S[p, 1]:

S[p,1]=B _(∈)[p,1]×((P|M)_(∈)[p,n]·U[n,r _(a)]·

[

1]),

wherein B[p,1]={circumflex over (Q)}[p,

]·

[

,1] is a multiplicative boost vector, U[n,r_(a)]=((R

|M)_(∈)[n,

]×L[n,1] is an adjusted attribute value recommendation matrix, L is whitelisting and blacklisting of attributes and attribute values, P is a product matrix P[p, n] of p products, each product having a total of n possible one-hot encoded attribute values, M is a matrix M[a, n] of one-hot encoded attribute-value to attribute assignments, Q is a first recommendation signal Q[q, r_(p)] from r_(p) product recommenders, R is a second recommendation signal R[n, r_(a)] from r_(a) attribute- and attribute-value recommenders, W_(p) is a vector W_(p)[r_(p), 1] of product recommender weights, W_(a) is a vector W_(a)[r_(a), 1] of attribute- and attribute-value recommender weights, and the recommendation signals comprise the first recommendation signal Q[q, r_(p)] and the second recommendation signal R[n, r_(a)]; (h) determining, by the computer system, an order of the top K of the matches based at least in part on the recommendation scores; (i) determining, by the computer system, an order of facets and facet values based at least in part on the recommendation scores; (j) generating, by the computer system, a search result comprising the ordered top K of the matches and the ordered facets and facet values; and (k) providing, by the computer system, the search result to the user device.

In at least one embodiment, the computer implemented method further comprises, after the step (g) and before the step (h), the step of selecting at most k from the top K of the matches that contribute at least s % of a total of the recommendation scores, wherein the s % is a predetermined percentage of the total of the recommendation scores and k≤K.

In at least one embodiment, the computer implemented method further comprises, after the step (g) and before the step (i), the step of selecting facet values that contribute at least a predetermined percentage of the total of the recommendation scores per facet.

In at least one embodiment, the step (g) of calculating recommendation scores comprises approximating, by the computer system using a machine learning model, a quadratic function S[p, 1]=(ϕ[p, r]·W_(p)[r, 1])×(ϕ[p, r]·W_(a)[r,1]) to learn the product recommender weights W_(p) and the attribute- and attribute-value recommender weights W_(a) based at least in part on key-performance-indicator driving feedback signals, wherein:

ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

];

ϕ_(a1)[p,r _(a)]=(P|M)_(∈)[p,n]·U[n,r _(a)];

ϕ_(a1)[p,r]=ϕ_(p)[p,r _(p)]∪ϕ_(a1)[p,r _(a)], where r=r _(p) +r _(a),

and a log associated with ϕ is used as an input into training of the machine learning model.

In at least one embodiment, the step (g) of calculating recommendation scores comprises directly approximating, by the computer system using a first machine learning model, product feature weights W[{tilde over (r)}, 1] wherein:

S(ϕ)[p,1]=Φ[p,{tilde over (r)}]·W[{tilde over (r)},1]

W[{tilde over (r)},1]=(Φ^(T)[{tilde over (r)},p _(t)]·Φ[p _(t) ,{tilde over (r)}]+αI[{tilde over (r)},{tilde over (r)}])⁻¹·Φ^(T)[{tilde over (r)},p _(t)]·S[p _(t),1],

{tilde over (r)} is a number of transformed features from r input features,

and transformed features Φ[p, {tilde over (r)}] of a log associated with ϕ[p, r] are used as input into training of the first machine learning model.

In at least one embodiment, the transformed features Φ[p, {tilde over (r)}] are obtained by:

$\bigcup\limits_{k = 1}^{p}{{\varphi (k)} \otimes {\varphi (k)}}$

In at least one embodiment, the step (g) of calculating recommendation scores comprises approximating, by the computer system using a first machine learning model, a linear function

_(a2) to learn attribute-and-attribute value weights W_(a) wherein:

_(a2)(P,M,R,W _(a))[n,1]=U[n,r _(a)]·

[

,1]

ϕ_(a2)[n,r _(a)]=U[n,r _(a)]

a log associated with ϕ_(a2) is used as an input into training of the first machine learning model.

In at least one embodiment, the first machine learning model comprises a pointwise training discipline using a linear model.

In at least one embodiment, the step (g) of calculating recommendation scores further comprises approximating, by the computer system using a second machine learning model, a feature transformation function

Φ[p,r _(p)]=ϕ_(p)[p,r _(p)]×((P|M)_(∈)[p,n]·

_(a2)(ϕ_(a2)[p,r _(a)])[n,1])

to learn the product recommender weights W_(p), wherein:

S[p,1]=Φ[p,r _(p)]·W _(p)[r _(p),1]; and

ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

].

In at least one embodiment, the second machine learning model comprises a pointwise training discipline using a linear model.

In at least one embodiment, wherein the step (g) of calculating recommendation scores comprises computing, by the computer system, a vector of attribute value scores V[n, 1] in accordance with:

V[n,1]=U[n,r _(a)]·

[

,1].

In at least one embodiment, the step (g) of calculating recommendation scores further comprises computing, by the computer system, attribute scores

(V) by summing each block in accordance with:

(V)=A[a,1]=M[a,n]·V[n,1].

In at least one embodiment, the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of cumulative attribute value ranks I[n, 1] in accordance with:

${I\left\lbrack {n,1} \right\rbrack} = {{hi\_ sort}{\left( {{{V\left\lbrack {n,1} \right\rbrack} \times {\sum\limits_{p}{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} \right)^{T}\left\lbrack {n,p} \right\rbrack}}},{M\left\lbrack {a,n} \right\rbrack},{DESC}} \right).}}$

In at least one embodiment, the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of discriminative attribute value ranks I[n, 1] in accordance with:

I[n,1]=hi_sort(V[n,1]×(1+ε−

^(T))[n,1],M[a,n],DESC),

wherein ε[n, 1] is a vector having a very small value ε for each element to facilitate a tie-break,

${\left( {{G_{1}\left\lbrack {1,n} \right\rbrack} - {G_{2}\left\lbrack {1,n} \right\rbrack}} \right) \cdot {V\left\lbrack {n,1} \right\rbrack}} = {{\Delta \; {{G\left\lbrack {1,n} \right\rbrack} \cdot {V\left\lbrack {n,1} \right\rbrack}}} = {\sum\limits_{i = 1}^{n}{\left( {{G_{1}(i)} - {G_{2}(i)}} \right)*{{V(i)}.}}}}$

$\mspace{79mu} {{G_{1}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{1}}{\mathcal{F}_{p_{1}}\left\lbrack {p_{1},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}{\left( {PM} \right)_{\epsilon_{1}}\left\lbrack {p_{1},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ $\mspace{79mu} {{G_{2}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{2}}{\mathcal{F}_{p_{2}}\left\lbrack {p_{2},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}{\left( {PM} \right)_{\epsilon_{2}}\left\lbrack {p_{2},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ ${\Delta \; {G_{1}\left\lbrack {1,n} \right\rbrack}} = {{{\left( {\sum\limits_{p_{1}}B_{e}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}} - {{\left( {\sum\limits_{p_{2}}B_{e}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}}}$

and wherein G₁ and G₂ are group vectors for a first group of products p₁ and a second group of products p₂, respectively, the first group of products p₁ and the second group of products p₂ being selected from the p products based on their recommendation scores in S[p, 1].

In addition, the present invention also relates to a computer system comprising one or more memories comprising a search index database; one or more processors operatively connected to the one or more memories; and one or more computer readable media operatively connected to the one or more processors and having stored thereon computer instructions for carrying out the steps of (a) receiving, at the computer system, a search query containing a search phrase from a user device; (b) executing, by the computer system, a query on a search index database to identify matches in response to the search query; (c) computing, by the computer system, facet counts associated with the matches; (d) ranking, by the computer system, the matches; (e) selecting, by the computer system, top K of the matches, wherein K is a predetermined number; (f) receiving, at the computer system from a plurality of recommendation systems, a plurality of recommendation signals generated by the plurality of recommendation systems based at least in part on the search phrase, a customer record, the top K of the matches and the facet counts; (g) calculating, by the computer system, recommendation scores based at least in part on a combination of the plurality of recommendation signals from the plurality of recommendation systems in accordance with a scoring function S[p, 1]:

S[p,1]=B _(∈)[p,1]×((P|M)_(∈)[p,n]·U[n,r _(a)]·

[

1]).

wherein B[p, 1]={circumflex over (Q)}[p,

]·

[

, 1] is a multiplicative boost vector, U[n,r_(a)]=((R

|M)_(∈)[n,

]×L[n,1] is an adjusted attribute value recommendation matrix, L is whitelisting and blacklisting of attributes and attribute values, P is a product matrix P[p, n] of p products, each product having a total of n possible one-hot encoded attribute values, M is a matrix M[a, n] of one-hot encoded attribute-value to attribute assignments, Q is a first recommendation signal Q[q, r_(p)] from r_(p) product recommenders, R is a second recommendation signal R[n, r_(a)] from r_(a) attribute- and attribute-value recommenders, W_(p) is a vector W_(p)[r_(p), 1] of product recommender weights, W_(a) is a vector W_(a)[r_(a), 1] of attribute- and attribute-value recommender weights, and the recommendation signals comprise the first recommendation signal Q[q, r_(p)] and the second recommendation signal R[n, r_(a)]; (h) determining, by the computer system, an order of the top K of the matches based at least in part on the recommendation scores; (i) determining, by the computer system, an order of facets and facet values based at least in part on the recommendation scores; (j) generating, by the computer system, a search result comprising the ordered top K of the matches and the ordered facets and facet values; and (k) providing, by the computer system, the search result to the user device.

In at least one embodiment, the computer instructions further carry out, after the step (g) and before the step (h), the step of selecting at most k from the top K of the matches that contribute at least s % of a total of the recommendation scores, wherein the s % is a predetermined percentage of the total of the recommendation scores and k≤K.

In at least one embodiment, the computer instructions further carry out, after the step (g) and before the step (i), the step of selecting facet values that contribute at least a predetermined percentage of the total of the recommendation scores per facet.

In at least one embodiment, the step (g) of calculating recommendation scores comprises approximating, by the computer system using a machine learning model, a quadratic function S[p, 1]=(ϕ[p, r]·W_(p)[r, 1])×(ϕ[p, r]·W_(a)[r, 1]) to learn the product recommender weights W_(p) and the attribute- and attribute-value recommender weights W_(a) based at least in part on key-performance-indicator driving feedback signals, wherein:

ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

];

ϕ_(a1)[p,r _(a)]=(P|M)_(∈)[p,n]·U[n,r _(a)];

ϕ[p,r]=ϕ_(p)[p,r _(p)]∪ϕ_(a1)[p,r _(a)], where r=r _(p) +r _(a);

and a log associated with ϕ is used as an input into training of the machine learning model.

In at least one embodiment, the step (g) of calculating recommendation scores comprises directly approximating, by the computer system using a first machine learning model, product feature weights W[{tilde over (r)}, 1] wherein:

S(ϕ)[p,1]=Φ[p,{tilde over (r)}]·W[{tilde over (r)},1]

W[{tilde over (r)},1]=(Φ^(T)[{tilde over (r)},p _(t)]·Φ[p _(t) ,{tilde over (r)}]+αI[{tilde over (r)},{tilde over (r)}])⁻¹·Φ^(T)[{tilde over (r)},p _(t)]·S[p _(t),1],

and transformed features Φ[p, {tilde over (r)}] of a log associated with ϕ[p, r] are used as input into training of the first machine learning model.

In at least one embodiment, the transformed features Φ[p, {tilde over (r)}] are obtained by:

$\overset{p}{\bigcup\limits_{k = 1}}{{\varphi (k)} \otimes {\varphi (k)}}$

In at least one embodiment, the step (g) of calculating recommendation scores comprises approximating, by the computer system using a first machine learning model, a linear function

_(a2) to learn attribute-and-attribute value weights W_(a) wherein:

_(a2)(P,M,R,W _(a))[n,1]=U[n,r _(a)]·

[

,1]

ϕ_(a2)[n,r _(a)]=U[n,r _(a)]

a log associated with ϕ_(a2) is used as an input into training of the first machine learning model.

In at least one embodiment, the first machine learning model comprises a pointwise training discipline using a linear model.

In at least one embodiment, the step (g) of calculating recommendation scores further comprises approximating, by the computer system using a second machine learning model, a feature transformation function

Φ[p,r _(p)]=ϕ_(p)[p,r _(p)]×((P|M)_(∈)[p,n]·

_(a2)(ϕ_(a2)[p,r _(a)])[n,1])

to learn the product recommender weights W_(p), wherein:

S[p,1]=Φ[p,r _(p)]·W _(p)[r _(p),1]; and

ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

].

In at least one embodiment, the second machine learning model comprises a pointwise training discipline using a linear model.

In at least one embodiment, the step (g) of calculating recommendation scores comprises computing, by the computer system, a vector of attribute value scores V[n, 1] in accordance with:

V[n,1]=U[n,r _(a)]·

[

,1].

In at least one embodiment, the step (g) of calculating recommendation scores further comprises computing, by the computer system, attribute scores

(V) by summing each block in accordance with:

(V)=A[a,1]=M[a,n]·V[n,1].

In at least one embodiment, the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of cumulative attribute value ranks I[n, 1] in accordance with:

${I\left\lbrack {n,1} \right\rbrack} = {{hi\_ sort}{\left( {{{V\left\lbrack {n,1} \right\rbrack} \times {\sum\limits_{p_{1}}{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} \right)^{T}\left\lbrack {n,p} \right\rbrack}}},{M\left\lbrack {a,n} \right\rbrack},{DESC}} \right).}}$

In at least one embodiment, the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of discriminative attribute value ranks I[n, 1] in accordance with:

I[n,1]=hi_sort(V[n,1]×(1+ε−

^(T))[n,1],M[a,n],DESC),

wherein ε[n, 1] is a vector having a very small value for each element to facilitate a tie-break,

${{\left( {{G_{1}\left\lbrack {1,n} \right\rbrack} - {G_{2}\left\lbrack {1,n} \right\rbrack}} \right) \cdot {V\left\lbrack {n,1} \right\rbrack}} = {{\Delta \; {{G\left\lbrack {1,n} \right\rbrack} \cdot {V\left\lbrack {n,1} \right\rbrack}}} = {\sum\limits_{i = 1}^{n}{\left( {{G_{1}(i)} - {G_{2}(i)}} \right)*{V(i)}}}}},$

$\mspace{14mu} {{G_{1}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{1}}{\mathcal{F}_{p_{1}}\left\lbrack {p_{1},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}{\left( {PM} \right)_{\epsilon_{1}}\left\lbrack {p_{1},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ $\mspace{79mu} {{G_{2}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{2}}{\mathcal{F}_{p_{2}}\left\lbrack {p_{2},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}{\left( {PM} \right)_{\epsilon_{2}}\left\lbrack {p_{2},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ ${\Delta \; {G_{1}\left\lbrack {1,n} \right\rbrack}} = {{{\left( {\sum\limits_{p_{1}}B_{e}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}} - {{\left( {\sum\limits_{p_{2}}B_{e}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}}}$

and wherein G₁ and G₂ are group vectors for a first group of products p₁ and a second group of products p₂, respectively, the first group of products p₁ and the second group of products p₂ being selected from the p products based on their recommendation scores in S[p, 1].

BRIEF DESCRIPTION OF THE DRAWINGS

Exemplary embodiments of the present invention will be described with references to the accompanying figures, wherein:

FIG. 1 is a schematic diagram illustrating a system in accordance with an exemplary embodiment of the present invention.

FIG. 2 is a schematic diagram showing exemplary inputs to and exemplary outputs from a module executing a scoring algorithm in accordance with an exemplary embodiment of the present invention.

FIG. 3 is a flow chart of a process for performing searches and generating search results in accordance with an exemplary embodiment of the present invention.

FIG. 4 is a flow chart of a process for calculating recommendation scores based on combined signals from multiple recommendation systems in accordance with an exemplary embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

In the context of product searches in one or more catalogs of products in, for example, e-commerce businesses, products have pre-defined “attributes,” such as brand, color, material, size, to name a few. Each specific product may be associated some subset of all possible attributes. Each attribute in each product has one or more specific values (e.g., brand=Nike or OCCASION=Evening, Wedding). When a keyword search result or a category browse result contains a number of products, that result may be further limited by selecting only those products from the result that contain a certain “attribute value” (e.g., color=Red). When the result is limited in this way, it is said that a “filter” is applied on the color attribute with the “filter value” Red. Given a keyword search result or a category browse result further filtered by zero or more filter values, one may calculate the number of products in the result corresponding to the various product attribute values contained within the products in the result. For example, one can calculate how many products with size=Extra Large (or XL) or material=Leather are present in the result. Such product counts are referred to as “facet counts.” The attributes, such as size or material, for which the counts may be calculated, are referred to as “facets.” The specific values of those attributes for which the counts may be calculated are referred to as “facet values.” It is not required to calculate facets for all possible attributes present in the result. Typically, only a certain subset of attributes may be selected, usually by a Merchandising Rule.

Many e-commerce businesses use multiple disparate recommendation systems, each of which provides signals relating to recommendation of products and categories of products (e.g., product attributes) to customers and group of customers. They are faced with various technical problems, including:

(1) How to combine the signals from the multiple recommendation systems to make the combined recommendations more relevant and useful than individual recommendations?

(2) Given product recommendations, how to compute product attribute (i.e. facet, facet value) recommendations? Given product attribute recommendations, how to compute product recommendations? Given both, how to implement mutual influence of the product and product attribute recommendations?

(3) Given a weighted blend of product and product attribute recommendation signals, how to learn the weights based on the customer activity?

To address the foregoing and/or other technical problems, at least one embodiment of the present invention may have one or more of the following technical features:

(i) A unified formula for computing product recommendation scores based on a combination of product and attribute recommendation signals.

(ii) A unified formula for computing attribute recommendation scores based on a combination of product and attribute recommendation signals.

(iii) Machine-learning unified formula weights based on product clicks.

(iv) Machine-learning unified formula weights based on product clicks and facet clicks.

Embodiments of the present invention may also provide various practical applications in the areas of keyword search, guided navigation, and product recommendations, including, but not limited to:

(a) Providing personalized product recommendations based on a blend of signals from multiple recommendation systems;

(b) Providing personalized product boosting in keyword search and in guided navigation based on a blend of signals from multiple recommendation systems;

(c) Providing personalized facet and facet value ordering in keyword search and guided navigation results pages based on a blend of signals from multiple recommendation systems;

(d) Improving non-personalized product sequencing in keyword search and guided navigation by incorporating additional signals based on the aggregate customer behavior; and

(e) Automatically learning the weights of blended recommendation signals based on the facet and product click feedback.

The methods of the present invention may be operated as part of a system for electronic commerce, such as one shown in FIG. 1. FIG. 1 is a schematic diagram illustrating a system 100 for electronic commerce in accordance with an exemplary embodiment of the present invention. The system 100 may comprise one or more user devices 10 that interact through a network 15, such as the Internet, with one or more merchant servers 40 that offer goods for sale via a web page or electronic application. Search queries generated by online shoppers from one or more user devices 10, such as tablets, smartphones, or computers, to name a few, connected through network 15, such as the Internet or an intranet, may be transmitted to one or more merchant servers 40, which, in turn, may transmit the search query through the network to one or more search engine servers 20. In embodiments, one or more administrator devices 30, connected through the network 15 to one or more merchant servers 40 or one or more search engine servers 20, may be used for configuring such servers. In other embodiments, one or more of the functions of each of these devices and servers may be combined into a single server, and/or divided up among one or more other servers performing duplicate and/or portions of the functions.

Referring to FIG. 1, a search engine server 20 may comprise a query interface module 210 configured to receive search queries and to transmit search results, a search controller module 220 configured to query a search index database 250 to find matches, a faceting module 230 configured to compute facet counts, and a scoring module 240 configured to rank the matches. In embodiments, the scoring module 240 may be further configured to select top K matches from all of the matches, wherein K may be a predetermined number (e.g., 5, 10, 50, 100, to name a few).

In embodiments, the search controller module 220 and the search index database 250 may be modules from a basic document search engine that are capable of a full conventional keyword search using a searchable keyword index. In embodiments, the search index database 250 may comprise one or more documents, each having one or more fields that may be separately indexed for searching. In connection with an electronic commerce application, a “document” may correspond to a particular product sold by a merchant and each searchable field of a “document” may correspond to a certain attribute of the product, such as, but not limited to, brand, size, color, type, etc.

In embodiments, the search index database 250 comprises multiple searchable indices that facilitate the searching of products. Each index may be associated with a particular attribute of a product and is generated from one or more terms corresponding to the particular attribute of all the products.

Referring back to FIG. 1, the search engine server 20 may be connected, either directly as shown in the figure, or through a network such as the Internet 15, a matrix attribute recommendation learning ensemble engine (“MARLENE”) server 50. As further described below, the MARLENE server 50 may be configured to compute, using a MARLENE scoring module 510, recommendation scores (“MARLENE scores”) based on combined signals from multiple recommendation systems in accordance with an exemplary embodiment of the present invention. In alternative embodiments, the MARLENE server 50 and MARLENE scoring module 510 may be part of the search engine server 20.

FIG. 2 schematically illustrates an exemplary operation of the MARLENE scoring module 510. As shown in the figure, the MARLENE scoring module 510 may be configured to receive a set of inputs 520 a, which may include some or all of the following: (a) search phrase 523, (b) customer record 524, (c) top K matches with scores 525 from a search, (d) facets with counts 526 for all matches. In embodiments, the inputs 520 a to the MARLENE scoring module 510 may further include product category ID and/or a list of product IDs.

The MARLENE scoring module 510 may also receive another set of inputs 520 b generated by a plurality of recommendation systems based on some or all of the inputs 520 a. This input set 520 b may include, as shown in FIG. 2, (a) one or more product recommending signals 521-1, . . . , 521-M, (b) one or more attribute recommending signals 522-1, . . . , 522-N. In embodiments, the set of inputs 520 b to the MARLENE scoring module 510 may further include: attribute-value recommending signals, whitelisting attributes, whitelisting attribute values, blacklisting attributes and blacklisting attribute values, to name a few. In embodiments, the MARLENE scoring module 510 may also be configured to receive as inputs machine-learned or otherwise pre-selected weights W₁, . . . , W_(M+N), for the corresponding product recommending signals 521-1, . . . , 521-M, and attribute recommending signals 522-1, . . . , 522-N.

Based on the inputs 520 a, 520 b, the MARLENE scoring module 510 may execute the MARLENE scoring algorithm 511 to generate a set of outputs 530, which may include product order 531, facet order 532 and facet value order 533 in accordance with an exemplary embodiment of the present invention.

FIG. 3 shows a high level flow chart illustrating an exemplary search algorithm for producing ranked search results as executed by the search engine server 20 in accordance with an exemplary embodiment of the present invention. In step S01, the query interface module 210 receives a search query string from a user device 10. In step S02, the search controller module 220 queries the search index database 250 to find matches. In step S03, the faceting module 230 computes facet counts for the matches. In step S04, the scoring module 240 ranks the matches. In step S05, the scoring module 240 selects to top K matches from the matches ranked by step S04. In step S06, the MARLENE scoring module 510 may take over the process, as further illustrated in greater detail in FIG. 4.

FIG. 4 shows a flow chart illustrating an exemplary scoring algorithm (e.g., MARLENE scoring algorithm 511) for computing recommendation scores based on combined signals from a plurality of recommendation systems as executed by the MARLENE server 50 in accordance with an exemplary embodiment of the present invention. In step S06 a, the MARLENE scoring module 510 receives a set of inputs 520 a, such as search phrase 523, customer record ID 524, top K matches with scores 525, and facets with counts 526. In step S06 b, the MARLENE scoring module 510 receives recommendation signals 520 b from a plurality of recommendation systems, such as a plurality of product recommending signals 521-1, . . . , 521-M and a plurality of attribute recommending signals 522-1, . . . , 522-N. In step S06 c, the MARLENE scoring module 510 computes recommendation scores (“MARLENE scores”) based on combined recommendation signals from the multiple recommendation systems using MARLENE formula, which is further described below. In step S06 d, the MARLENE scoring module 510 selects at most R results, which contribute at least S % of the total MARLENE scores, wherein S % may be a predetermined percentage (e.g., 50%, 75%, 90%, 99%, to name a few) and R is a corresponding integer. In step S06 e, the MARLENE scoring module 510 selects at most V facet values which contribute at least F % of the total MARLENE score per facet, wherein F % may be a predetermined percentage (e.g., 50%, 75%, 90%, 99%, to name a few) and V is a corresponding integer. In step S06 f, the MARLENE scoring module 510 generates outputs 530 and transmits them to the scoring module 240 and the faceting module 230 of the search engine server 20. The outputs 530 may include K or fewer matches with new recommendation scores (e.g., product order 531), recommended facets in the order of recommendation scores (e.g., facet order 532) and recommended facet values in the order of recommendation scores (e.g., facet value order 533), in accordance with an exemplary embodiment of the present invention.

Referring back to FIG. 3, in step S07, the scoring module 240 receives K or fewer matches with new recommendation scores from the MARLENE scoring module 510 and re-orders them in accordance with the MARLENE scores computed by the MARLENE scoring algorithm 511. In step S08, the faceting module 230 receives recommended facets and recommended facet values from the MARLENE scoring module 510 and re-orders them in accordance with the MARLENE scores computed by the MARLENE scoring algorithm 511. In step S09, the query interface module 210 transmits ranked search results to the user device 10.

In at least one embodiment of the present invention, a formula (“MARLENE formula”) for calculating recommendation scores may be derived as follows. Brief explanations of the mathematical terms and notations used herein are provided in the Appendix at the end of this specification.

Properties of Scoring Function

According to an exemplary embodiment of the present invention, the MARLENE formula may incorporate some or all of the following aspects of the MARLENE scoring algorithm:

(i) Attribute-recommending signals;

(ii) Attribute value-recommending signals;

(iii) Product-recommending signals;

(iv) Whitelisting attributes based on, for example, the facets selected by merchant's merchandising rule framework;

(v) Whitelisting attribute values based on, for example, the value ordering imposed by merchant's merchandising rule framework;

(vi) Blacklisting attributes; and

(vii) Blacklisting attribute values.

In accordance with an exemplary embodiment of the present invention, the MARLENE formula may be in the form of a scoring function S, which may map all of the inputs into a vector of product scores, as shown below in (1):

S[p,1]=

(P,M,Q,R,W _(p) ,W _(a))[p,n]  (1)

wherein:

P is a product matrix P[p, n] of p products, each product having a total of n possible one-hot encoded attribute values,

M is a matrix M[a, n] of one-hot encoded attribute-value to attribute assignments,

Q is a recommendation signal Q[q, r_(p)] from r_(p) product recommenders,

R is a recommendation signal R [n, r_(a)] from r_(a) attribute- and attribute-value recommenders,

W_(p) is a vector W_(p)[r_(p), 1] of the product recommender weights, and

W_(a) is a vector W_(a)[r_(a), 1] of the attribute- and attribute-value recommender weights.

Examples of inputs to a scoring function S are described below. This particular example involves p=10 products with a=5 possible attributes including: TYPE, BRAND, COLOR, MATERIAL and STYLE. In embodiments, some of these attributes may be optional.

Product ID TYPE BRAND COLOR MATERIAL STYLE 1 Shoes Nike Black Leather Oxford 2 Shoes Adidas Red Suede 3 Shoes Nike White Leather Oxford 4 Pants Adidas Blue, Red Denim 5 Pants Nike Black Chino 6 Pants Nike White Chino 7 Dress Billabong Red, White, Silk Maxi Blue 8 Dress Billabong Black Silk 9 Dress Nike Blue Nylon 10 Blender Kitchenaid White

Across p=10 products and a=5 attributes, there may be found n=20 unique attribute values. Exemplary attribute values are one-hot encoded in the following exemplary matrix P[10, 20]:

Product Billa- Kitch- ID Shoes Pants Dress Blender Nike Adidas bong enaid 1 1 0 0 0 1 0 0 0 2 1 0 0 0 0 1 0 0 3 1 0 0 0 1 0 0 0 4 0 1 0 0 0 1 0 0 5 0 1 0 0 1 0 0 0 6 0 1 0 0 1 0 0 0 7 0 0 1 0 0 0 1 0 8 0 0 1 0 0 0 1 0 9 0 0 1 0 1 0 0 0 10  0 0 0 1 0 0 0 1

Product ID Black White Red Blue Leather Suede 1 1 0 0 0 1 0 2 0 0 1 0 0 1 3 0 1 0 0 1 0 4 0 0 1 1 0 0 5 1 0 0 0 0 0 6 0 1 0 0 0 0 7 0 1 1 1 0 0 8 1 0 0 0 0 0 9 0 0 0 1 0 0 10  0 1 0 0 0 0

Product ID Denim Chino Silk Nylon Oxford Maxi 1 0 0 0 0 1 0 2 0 0 0 0 0 0 3 0 0 0 0 1 0 4 1 0 0 0 0 0 5 0 1 0 0 0 0 6 0 1 0 0 0 0 7 0 0 1 0 0 1 8 0 0 1 0 0 0 9 0 0 0 1 0 0 10  0 0 0 0 0 0

The above attribute values may be assigned to attributes via an exemplary block assignment matrix M[5, 20] as shown below:

Billa- Kitch- Attribute Shoes Pants Dress Blender Nike Adidas bong enaid TYPE 1 1 1 1 0 0 0 0 BRAND 0 0 0 0 1 1 1 1 COLOR 0 0 0 0 0 0 0 0 MATE- 0 0 0 0 0 0 0 0 RIAL STYLE 0 0 0 0 0 0 0 0

Attribute Black White Red Blue Leather Suede TYPE 0 0 0 0 0 0 BRAND 0 0 0 0 0 0 COLOR 1 1 1 1 0 0 MATERIAL 0 0 0 0 1 1 STYLE 0 0 0 0 0 0

Attribute Denim Chino Silk Nylon Oxford Maxi TYPE 0 0 0 0 0 0 BRAND 0 0 0 0 0 0 COLOR 0 0 0 0 0 0 MATERIAL 1 1 1 1 0 0 STYLE 0 0 0 0 1 1

In this example, the current page may be rendered for a customer who received Attribute Recommendation Signals from, for example, three available Attribute Recommenders r_(a)(1), r_(a)(2) and r_(a)(3).

r_(a)(1) may be, for example, an attribute recommender rather than attribute value recommender, and recommends, for example, COLOR and STYLE attributes as important for the customer. To achieve this, this attribute recommender may produce, for example, the same recommendation score for all values of COLOR and the same recommendation score for all values of STYLE.

r_(a)(2) may be, for example, a facet value count recommender. Its recommendation scores may be calculated based on an assumption that all 10 products are in the result set.

r_(a)(3) may be, for example, a brand propensity recommender. It may assign a dollar amount to each BRAND, predicting the customer's spend in the next few months.

With the above assumptions, the recommendation signal matrix R[20, 3] may look like the following:

Attribute Value r_(a)(1) r_(a)(2) r_(a)(3) Shoes  0 3  0 Pants  0 3  0 Dress  0 3  0 Blender  0 1  0 Nike  0 5  50 Adidas  0 2 400 Billabong  0 2 200 Kitchenaid  0 1 500 Black  50 3  0 White  50 4  0 Red  50 3  0 Blue  50 3  0 Leather  0 2  0 Suede  0 1  0 Denim  0 1  0 Chino  0 2  0 Silk  0 2  0 Nylon  0 1  0 Oxford 100 2  0 Maxi 100 1  0

Similarly, assuming that the customer received Product Recommendation Signals from, for example, two available Product Recommenders, r_(p)(1) and r_(p)(2), the recommendation signal matrix Q[10, 2] may look like the following:

Product ID r_(p)(1) r_(p)(2) 1 0 0 2 0 10000 3 0 0 4 0 0 5 5 0 6 4 0 7 3 0 8 2 5000 9 1 0 10 0 0

While the weights of attribute and product recommenders W_(a) and W_(p) can be machine-learned, it is assumed that they are all preset to 1.0 in this example.

In embodiments, machine learning of the signal weight vectors W_(p) and W_(a) may be implemented in the scoring function S. To support machine learning of the weight vectors, the scoring function S may require additional properties.

For example, as shown below in (2a) and (2b), there may be found two different ways of expressing the scoring function

via other functions, using only part of the inputs each:

=

₁(

_(p)(Q,W _(p))[p,1],

_(a1)(P,M,R,W _(a))[p,1])

=

₂(

_(p)(Q,W _(p))[p,1],

_(a2)(P,M,R,W _(a))[n,1])  (2a, 2b)

The benefit of such representation of the scoring function would be that

_(p) would use only product recommendation signals, and

_(a) would use only attribute recommendation signals.

The difference between

_(a1) and

_(a2) is that product signal is produced by using attribute recommendations in

_(a1), while attribute value signal is produced by using attribute recommendations in

_(a2).

In embodiments, there may be found further functions ϕ_(p), ϕ_(a1), and ϕ_(a2), which are independent from the signal weights W_(p) and W_(a), such that:

ϕ_(p)[p,r _(p)]⇒

(Q,W _(p))=

_(p)(ϕ_(p)(Q),W _(p))[p,1]

ϕ_(a1)[p,r _(a)]⇒

_(a1)(P,M,R,W _(a))=

_(a1)(ϕ_(a1)(P,M,R),W _(a))[p,1]

ϕ_(a2)[n,r _(a)]⇒

_(a2)(P,M,R,W _(a))=

_(a2)(ϕ_(a2)(P,M,R),W _(a))[n,1]  (3a-3c)

If product scores can be expressed in this way, then ϕ_(p), ϕ_(a1), and ϕ_(a2) can be used as feature vectors in machine learning algorithms in order to learn weights W_(p) and W_(a). In embodiments, these feature vectors may be returned from the MARLENE scoring module 510 with its application programming interface (API) response, and tagged onto products and facet values. In such a case, the difference between ϕ_(a1)[p, r_(a)] and ϕ_(a2)[n, r_(a)] is that the former may be used with a product-based feedback signal (e.g., position on the page of a viewed product), while the latter may be used with facet-based feedback signal (e.g., position on the page of a clicked facet value).

As shown below in (4), an attribute value score vector V[n, 1] can be expressed in terms of

_(a2) which directly scores all attribute values:

V[n,1]=

_(a2)(ϕ_(a2) ,W _(a))[n,1]  (4)

Given the attribute value scores V[n, 1], there may be found a function

for scoring the attributes themselves. Given a attributes with n possible values across all a, the function

may be expressed as:

(V)=A[a,1]  (5)

where A[a, 1] is the vector of attribute scores. Derivation of Scoring Function

Product Recommendation Signals:

Product recommendation signals may assign a score to every product, so their output may be a matrix of scores. Given r_(p) product recommenders with weights W_(p)[r_(p), 1], there may be Q[q, r_(p)] product scores. In general, product recommenders provide signal for a set q which may be different from set p. In such a case, only the relevant recommendations need to be extracted.

In embodiments, two vectors, Z[p, 1] and Z[q, 1], may contain product IDs in the corresponding vector positions as shown below in (6a) and (6b):

$\begin{matrix} {{Z\left\lbrack {p,1} \right\rbrack} = {\overset{p}{\bigcup\limits_{i = 1}}{{product\_ id}(i)}}} & \left( {{6a},{6b}} \right) \\ {{Z\left\lbrack {q,1} \right\rbrack} = {\overset{q}{\bigcup\limits_{i = 1}}{{product\_ id}(i)}}} & \; \end{matrix}$

As shown below in (7), a left join operation can be performed on the first column to get the scores only for the products from the set p:

Q[p,r _(p)+1]=Z[p,1]

₁(Z[q,1]∪Q[q,r _(p)])  (7)

where a “left join” of matrix A to matrix B on a column j is defined as a mathematical operation for creating a new matrix A

B with all columns from A and from B, where columns borrowed from B in those rows where A(i,j)=B(i,j) have the values from B, and zero in the rest of the rows, as shown below in (8):

$\begin{matrix} {{{A\left\lbrack {a,b} \right\rbrack}\mspace{14mu} \mspace{14mu} {B\left\lbrack {c,d} \right\rbrack}} = {{C\left\lbrack {a,{b + d - 1}} \right\rbrack}:\mspace{14mu} \left\{ \begin{matrix} {{{C\left( {i,{b + \ldots}}\mspace{14mu} \right)} = {{{B\left( {i,x} \right)}{\forall{i\text{:}\mspace{14mu} {\exists{{B\left( {i,j} \right)}\text{:}\mspace{14mu} {A\left( {i,j} \right)}}}}}} = {B\left( {i,j} \right)}}},{x \neq j}} \\ {{C\left( {i,{b + \ldots}}\mspace{14mu} \right)} = {{0.0\; {\forall{i\text{:}\mspace{14mu} {\nexists{{B\left( {i,j} \right)}\text{:}\mspace{14mu} {A\left( {i,j} \right)}}}}}} = {B\left( {i,j} \right)}}} \end{matrix} \right.}} & (8) \end{matrix}$

After the left join operation is performed, as shown below in (9), the first column of product IDs can be dropped since it is no longer necessary:

$\begin{matrix} {{Q\left\lbrack {p,r_{p}} \right\rbrack} = {\overset{r_{p}}{\bigcup\limits_{i = 2}}{\left( {{Q\left\lbrack {p,{r_{p} + 1}} \right\rbrack}\left( {p,i} \right)} \right)\left\lbrack {p,1} \right\rbrack}}} & (9) \end{matrix}$

Finally, to ensure that the signals from different product recommenders are commensurate, the matrix Q[p, r_(p)] may be normalized column-wise as shown below in (10):

{circumflex over (Q)}[p,

]  (10)

In addition, for ease of use, the vector of product recommender weights can also be normalized such that all resulting product scores are in the range [0, 1].

In embodiments, a vector defined by a product of the normalized product score matrix and the normalized product recommender weights vector B[p,1]={circumflex over (Q)}[p,

]·

[

,1], can be used as a multiplicative boost vector in the definition of S[p, 1].

Referring back to the sample input data used in the example described above, assuming that L1-normalization is used in all examples, {circumflex over (Q)}[p,

] based on the sample input data may look like the following:

Product ID

1 0 0 2 0 0.6667 3 0 0 4 0 0 5 0.3333 0 6 0.2667 0 7 0.2 0 8 0.1333 0.333 9 0.0667 0 10 0 0

Given the above-mentioned assumption that the weights of attribute and product recommenders W_(a) and W_(p) are preset to 1.0 for all recommenders, the normalized vector of product recommender weights

[

, 1]=(0.5,0.5) and B[p, 1] becomes:

Product ID Calculation B  1    0*0.5 + 0*0.5 0  2    0*0.5 + 0.6667*0.5 0.3333  3    0*0.5 + 0*0.5 0  4    0*0.5 + 0*0.5 0  5  0.333*0.5 + 0*0.5 0.1667  6 0.2667*0.5 + 0*0.5 0.1333  7   0.2*0.5 + 0*0.5 0.1  8 0.1333*0.5 + 0.3333*0.5 0.2333  9 0.0667*0.5 + 0*0.5 0.0333 10    0*0.5 + 0*0.5 0

Attribute- and Attribute-Value Recommendation Signals

Attribute recommendation signals produce recommendations for whole attributes. Such recommendations do not depend on the number of attribute value hits in a product, as long as there is at least one value hit.

Each attribute recommender produces a single vector of length n where all possible values of each recommended attribute are set to the same recommendation score of the corresponding attribute.

Given r attribute recommenders, their output can be represented as a matrix R[n, r_(a)], and the scores of the products based on the attribute recommendations can be calculated as follows:

S[p,r _(a)]=P[p,n]·R[n,r _(a)]  (11)

Attribute-value recommendation signals produce recommendations for specific attribute values, so not every element in the attribute value vectors has the same value as in the case of attribute recommenders. But the approach can be the same as attribute recommendation signals, as long as the recommendation signals are normalized to make them commensurate as shown below in (12):

S[p,r _(a)]=P[p,n]·{circumflex over (R)}[n,

]  (12)

In view of the foregoing, there may be no need to make a distinction between attribute value recommenders and attribute recommenders. The latter can be simply considered to recommend every value of the recommended attribute with the same score.

In addition, the quality of attribute-based product scoring may be enhanced if product recommendation scores are used as a multiplicative boost for the attribute recommendations as shown below in (13):

S[p,r _(a)]=(P[p,n]·{circumflex over (R)}[n,

])×B[p,1]  (13)

In this approach, the product recommendation signal may be lost for those products which do not have any attribute value recommendation hits (for such products, e.g., S(i)=0[1, r_(a)]), and the attribute recommendation signal may be lost for those products which do not have any product recommendation hits (for such products, e.g., B(i)=0.0). To rectify this problem, in embodiments, a very small value ∈ may be added to each element of the product matrix P, and to each element of the boost vector B, so that S(i)=(∈*B(i))[1,r_(a)]), B(i)=∈ instead of 0.0. This can be accomplished by initializing a matrix ε[p,n] and a vector ε[p, 1] with the same value ∈ in all its elements, as shown below in (14):

S[p,r _(a)]=((P[p,n]+ε[p,n])·{circumflex over (R)}[n,

])×(B[p,1]+ε[p,1])  (14)

The small value ∈ must be smaller than any non-zero recommendation signal but not so small that powers of ∈ would cause an overflow.

Whitelisting and Blacklisting Attributes and Attribute Values

Whitelisting of attributes can be expressed as a set of multiplicative scores over attribute values, where all allowed values have score 1.0, and all disallowed values have score 0.0. If several white-lists are provided, they can be combined via element-wise multiplication as shown below in (15):

$\begin{matrix} {{L\left\lbrack {n,1} \right\rbrack} = {\prod\limits_{w}\; {L_{w}\left\lbrack {n,1} \right\rbrack}}} & (15) \end{matrix}$

As shown below in (16), blacklisting can be implemented through additional vectors L_(b)[n,1], where individual blacklisted values are assigned the score 0.0, and all values of the blacklisted attributes are assigned score 0.0:

$\begin{matrix} {{L\left\lbrack {n,1} \right\rbrack} = {\prod\limits_{w}\; {{L_{w}\left\lbrack {n,1} \right\rbrack} \times {\prod\limits_{b}\; {L_{b}\left\lbrack {n,1} \right\rbrack}}}}} & (16) \end{matrix}$

As shown below in (17), the multiplicative scores vector can be applied via column-wise product with attribute value scores, before column normalization:

S[p,r _(a)]=((P[p,n]+ε[p,n])·(

)[n,

])×(B[p,1]+ε[p,1])  (17)

To prevent product attributes with several values from having more influence than product attributes with only one value, the matrix P and attribute value recommendations may be block averaged (e.g., only after applying the white-lists) as shown below in (18):

S[p,r _(a)]=(((P|M)[p,n]+ε[p,n])·((R

|M)[n,

])×(B[p,1]+ε[p,1])  (18)

But this solution (18) may still lose product recommendation signal B(i) for those products which do not have any hits in the white-listed attribute recommendations. In order to still be able to rank such products based on B(i), another small value ε[n, r_(a)] may be introduced to ensure that blacklisted attribute values remain blacklisted. To preserve blacklist zeros, it may be multiplied by L:

S[p,r _(a)]=(((P|M)[p,n]+ε[p,n])·(((R

|M)[n,

]+(ε×L)))×(B[p,1]+ε[p,1])  (19)

In the actual code implementation, it may be faster to adjust a matrix by replacing all zeros with ∈ than to initialize and add the ε-matrix. Such adjusted matrices are denoted herein with the “∈” subscript.

Based on the observation that:

L×L=L⇒(X×L)+(ε×L)=(X×L×L)+(ε×L)=(X×L+ε)×L≈(X×L)_(ε) ×L

the following formula may be obtained:

S[p,r _(a)]=((P|M)_(ε)[p,n]·(((R

|M)_(ε)[n,

]×L))×B _(ε)[p,1]  (20)

Total Product Scoring

For brevity, a name U may be assigned to the adjusted attribute value recommendations matrix:

U[n,r _(a)]=((R

|M)_(∈)[n,

]×L[n,1]  (21)

Total product scores may be calculated via dot-multiplication with the attribute recommender weights. All elements of the dot-product with U[n, r_(a)] belong to the range (0, 1] and all elements of B_(∈)[p,1] belong in the range (0, 1]. Therefore, all elements of S[p, r_(a)] belong to the range (0, 1] as well. To keep the final product scores in the same range, W_(a) may be normalized as shown in (22):

S[p,1]=(((P|M)_(∈)[p,n]·U[n,r _(a)])×B _(∈)[p,1])·

[

,1]  (22)

After re-arranging the terms in (21), the scoring function

may be expressed as the formula shown below in (23):

S[p,1]=B _(∈)[p,1]×((P|M)_(∈)[p,n]·U[n,r _(a)]·

[

,1])  (23)

This formula (23) for the scoring function does not lose signals when a product has no attribute recommendation hits, or when a product has no product recommendation hits.

Referring back to the example based on the sample data described above, assuming that ∈=10⁻¹², epsilon-adjusted B_(∈)[p,1] becomes:

Product ID B_(ϵ) 1 10⁻¹² 2 0.3333 3 10⁻¹² 4 10⁻¹² 5 0.1667 6 0.1333 7 0.1   8 0.2333 9 0.0333 10 10⁻¹²

Block-averaging and epsilon-adjusting product matrix yields the following (P|M)_(∈)[10,20]|:

Product Billa- Kitch- ID Shoes Pants Dress Blender Nike Adidas bong enaid 1  1    10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 2  1    10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 3  1    10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 4 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 5 10⁻¹²  1    10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 6 10⁻¹²  1    10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 7 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 8 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 9 10⁻¹² 10⁻¹²  1    10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 10  10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹²  1   

Product ID Black White Red Blue Leather Suede 1  1 10⁻¹² 10⁻¹² 10⁻¹²  1 10⁻¹² 2 10⁻¹² 10⁻¹²  1 10⁻¹² 10⁻¹²  1 3 10⁻¹²  1 10⁻¹² 10⁻¹²  1 10⁻¹² 4 10⁻¹² 10⁻¹²  0.5  0.5 10⁻¹² 10⁻¹² 5  1 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 6 10⁻¹²  1 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 7 10⁻¹²  0.3333  0.3333  0.3333 10⁻¹² 10⁻¹² 8  1 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 9 10⁻¹² 10⁻¹² 10⁻¹²  1 10⁻¹² 10⁻¹² 10  10⁻¹²  1 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹²

Product ID Denim Chino Silk Nylon Oxford Maxi 1 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 2 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 3 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 4  1    10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 5 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 6 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 7 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹²  1    8 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10⁻¹² 9 10⁻¹² 10⁻¹² 10⁻¹²  1    10⁻¹² 10⁻¹² 10  10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹² 10⁻¹²

For example, let's assume that for some business reason, it is not permitted to consider BRAND=Adidas in the recommendations. That means that this attribute value is black-listed, so the whitelist L of attribute values may look like the following:

Attribute Value L Shoes 1 Pants 1 Dress 1 Blender 1 Nike 1 Adidas 0 Billabong 1 Kitchenaid 1 Black 1 White 1 Red 1 Blue 1 Leather 1 Suede 1 Denim 1 Chino 1 Silk 1 Nylon 1 Oxford 1 Maxi 1

Now the epsilon-adjusted white-listed normalized and block-averaged attribute recommendations U[20, 3] may be calculated as follows:

Attribute r_(a)(1) × (r_(a)(1) × r_(a)(2) × (r_(a)(2) × r_(a)(3) × (r_(a)(3) × Value L L)/M U (1) L L)/M U (2) L L)/M U (3) Shoes 0 0 10⁻¹² 3 0.3 0.06 0 0 10⁻¹² Pants 0 0 10⁻¹² 3 0.3 0.06 0 0 10⁻¹² Dress 0 0 10⁻¹² 3 0.3 0.06 0 0 10⁻¹² Blender 0 0 10⁻¹² 1 0.1 0.02 0 0 10⁻¹² Nike 0 0 10⁻¹² 5 0.625 0.125 50  0.0667  0.0667 Adidas 0 0  0 0 0 0 0 0  0 Billabong 0 0 10⁻¹² 2 0.25 0.05 200  0.2667  0.2667 Kitchenaid 0 0 10⁻¹² 1 0.125 0.025 500  0.0667  0.0667 Black 50  0.25  0.125 3 0.2308 0.0462 0 0 10⁻¹² White 50  0.25  0.125 4 0.3077 0.0615 0 0 10⁻¹² Red 50  0.25  0.125 3 0.2308 0.0462 0 0 10⁻¹² Blue 50  0.25  0.125 3 0.2308 0.0462 0 0 10⁻¹² Leather 0 0 10⁻¹² 2 0.2222 0.0444 0 0 10⁻¹² Suede 0 0 10⁻¹² 1 0.1111 0.0222 0 0 10⁻¹² Denim 0 0 10⁻¹² 1 0.1111 0.0222 0 0 10⁻¹² Chino 0 0 10⁻¹² 2 0.2222 0.0444 0 0 10⁻¹² Silk 0 0 10⁻¹² 2 0.2222 0.0444 0 0 10⁻¹² Nylon 0 0 10⁻¹² 1 0.1111 0.0222 0 0 10⁻¹² Oxford 100  0.5  0.25 2 0.6667 0.1333 0 0 10⁻¹² Maxi 100  0.5  0.25 1 0.3333 0.0667 0 0 10⁻¹²

Given the assumption that preset weights equal to 1.0 for all recommenders, the normalized vector of attribute recommender weights

[

, 1]=(0.3333,0.3333,0.3333), and the attribute value recommendation vector becomes:

Attribute Value U · Ŵ_(a) Shoes 0.02 Pants 0.02 Dress 0.02 Blender 0.0067 Nike 0.0639 Adidas 0 Billabong 0.1056 Kitchenaid 0.2306 Black 0.0571 White 0.0622 Red 0.0571 Blue 0.0571 Leather 0.0148 Suede 0.0074 Denim 0.0074 Chino 0.0148 Silk 0.0148 Nylon 0.0074 Oxford 0.1278 Maxi 0.1056

S[p, 1] can be calculated from the above tables for B_(∈)[10,1], (P|M)_(∈)[10,20] and (U·

)[20,1]:

Product (P|M)_(ϵ) · U · 

S 1 0.2835 0.2835 * 10⁻¹² 2 0.0845 0.0282 3 0.2886 0.2886 * 10⁻¹² 4 0.0845 0.0845 * 10⁻¹² 5 0.1558 0.0260 6 0.1609 0.0214 7 0.3047 0.0305 8 0.1974 0.0461 9 0.1484 0.0049 10 0.2994 0.2994 * 10⁻¹² Total 0.1570

From the above-calculated recommendation scores, the following product ranking may be obtained:

Product ID Rank 1 9 2 3 3 8 4 10 5 4 6 5 7 2 8 1 9 6 10 7

In embodiments, it may sometimes be necessary to boost only “high-confidence” recommendations, instead of re-ranking the entire top-K results based on which the recommended ranks were computed. In such a case, not more than X products, which contribute not more than Y % of the total recommendation score, may be chosen. For example, if Y=80% and X=3, then one needs to select not more than three top-scoring products which contribute approximately 0.157×0.8=0.1256 to the total recommendation score. Since three top-ranking products contribute 0.1048 and four top-ranking products contribute 0.1308, the 80% boundary lies somewhere between rank 3 and rank 4. In this example, products with IDs: 8, 7 and 2 may be selected.

Machine Learning Formulation

Machine learning methods may be used to train one or more models to approximate the above-described signal weights W_(a) and W_(p). For example, pointwise, pairwise, and listwise methods—or the like—may be used for training ranking models based on input features per product per session, and the customer behavior associated with that session, such as product views, adding products to the shopping bag (ATB), and purchasing the products. Such ranking models have been used to build upon internal product scoring models. Product scores obtained from the internal scoring models are then used as inputs into the ranking objective function. For example, in a pairwise ranking model there may be an internal product scoring model, producing internal scores for the two products in a pair: a positive example, selected by a customer, and a negative example, not selected by the same customer. The objective of the ranking model is to maximize the score difference between the positive and negative examples.

Training a ranking model may update the internal scoring model as well.

The scores output by the scoring model may also be directly used for ranking products by sorting the products in the descending order of their scores. In this case the scoring model and the ranking model is the same model. It is trained to predict the score of any product given its features, i.e. it is trained pointwise.

Accordingly, a method is described for computing product features from the product and attribute recommendation signals, and two methods of computing recommendation signal weights assuming pointwise training for direct score-based ranking.

Product Feedback Based Scoring Model

Given the formula (2a) for S[p, 1], definitions for

₁,

_(p), and

_(a1) can be obtained by the following substitutions:

₁(f,g)=f×g

_(p)(ϕ_(p)(Q),W _(p))[p,1]=B _(∈)[p,1]={circumflex over (Q)} _(∈)[p,

]·

[

,1]

_(a1)(ϕ_(a1)(P,M,R),W _(a))[p,1]=(P|M)_(∈)[p,n]·U[n,r _(a)]·

[

,1]   (24a-24c)

wherein

₁ is composition via element-wise multiplication.

As shown below in (25a) and (25b), the feature matrixes for all products in P are parts of

_(p) and

_(a1), which are independent of the signal weights:

ϕ_(p)[p,r _(p)]={circumflex over (Q)}[p,

]  (25a)

ϕ_(a1)[p,r _(a)]=(P|M)_(∈)[p,n]·U[n,r _(a)]  (25b)

Since both ϕ_(p) and ϕ_(a1) are product-based matrixes, they can be concatenated to obtain the full matrix of product features:

ϕ[p,r]=ϕ_(p)[p,r _(p)]∪ϕ_(a1)[p,r _(a)], where r=r _(p) +r _(a)  (26)

In order to apply signal weights to the full feature matrix of dimension r, the weight vectors must also be extended by concatenating with zero vectors:

W _(p)[r,1]=

[

,1]∪0[r _(a),1]  (27a)

W _(a)[r,1]=0[r _(p),1]∪

[

,1]  (27b)

In the new notation, the following is obtained:

S[p,1]=(ϕ[p,r]·W _(p)[r,1])×(ϕ[p,r]·W _(a)[r,1])  (28)

S[p, 1] is therefore a quadratic function with unknown weights W_(p) and W_(a). S[p, 1] computes the scores of products, and ϕ[p, r] represents product features.

A non-linear scoring model may be learned by incorporating it into a ranking model trained on product features ϕ:

S(ϕ)[p,1]≈nonlinear_model(ϕ[p,r])  (29)

Referring back to the example described above, using the sample data used in the example, feature vectors ϕ[p=10, r=2+3=5] for all products may be calculated as follows:

Product ID ϕ(i, 1) ϕ(i, 2) ϕ(i, 3) ϕ(i, 4) ϕ(i, 5) 1 0 0 0.375 0.4089 0.0667 2 0 0.6667 0.125 0.1284 4*10⁻¹² 3 0 0 0.375 0.4242 0.0667 4 0 0 0.125 0.1284 4*10⁻¹² 5 0.3333 0 0.125 0.2756 0.0667 6 0.2667 0 0.125 0.2909 0.0667 7 0.2 0 0.375 0.2724 0.2667 8 0.1333 0.3333 0.125 0.2006 0.2667 9 0.0667 0 0.125 0.2534 0.0667 10 0 0 0.125 0.1065 0.6667

For example, the vectors of five numbers (rows in the above table) may be calculated for each product in the search result set from the recommendation signals collected from several recommendation systems. When a customer clicks on (e.g., adds to bag, purchases, etc.) one of the products from the results page, the five numbers associated with this product may be logged in the clickstream event log. This log may be used as an input into training of the ML model.

Linear Formulation

The formulation S(ϕ)≈nonlinear_model(ϕ), where S are product scores and ϕ are product features, is the generic formulation for a pointwise learn to rank problem. Different training methods for such a nonlinear model may be used—such as, nonlinear regression, kernel SVMs, decision trees, neural networks, and the like.

The disadvantage of such training methods is that weight vectors W_(a) and W_(p) are not directly computed and, therefore, Attribute Ranking approaches described below cannot be used.

Because product features ϕ[p, r] are not generic product features, but already personalized recommendation signals, the equation for S[p, 1] may be restated as a linear regression problem:

Φ[p,{tilde over (r)}]=transform_features(ϕ[p,r])  (30)

(number of features {tilde over (r)} in the linear model may differ from the number of recommendation signals r)

S(ϕ)[p,1]=Φ[p,{tilde over (r)}]·W[{tilde over (r)},1]≈linear_model(Φ[p,{tilde over (r)}])  (31)

Examples of linear models are mean squared error (MSE) regression and support vector machine (SVM) with linear kernel. Linear models are known to be unable to capture complex nuance of feature relationships as well as, for example, decision trees can. However, if recommendation signals are well-selected for relevance, this complex nuance may not be necessary.

Based on valid transformed features, the following pointwise training discipline using a linear model may be used:

1. Sample product view, add-to-bag (ATB), and purchase events p_(t)* across sessions in the training period, for example, the previous day. Training period and sampling strategy are hyperparameters of the model training.

2. Apply heuristic relevance model (HRM) to the sampled events. HRM should assign target relevance to every sampled event:

S _(t)[p _(t)*,1]=estimate_relevance(p _(t)*)>0  (32)

For example: for every product view event i, assign score S_(t)(i)=0.3; for every ATB event, assign score S_(t)(i)=0.7; for every purchase event, assign score S_(t)(i)=1.0.

3. For every session with one or more assigned S_(t)(i)>0, sample those products which were shown above the products with assigned S_(t)(i)>0, but which customer did not view. Assign S_(t)(j)=0 to sampled events. Number of sampled products and sampling strategy are hyperparameters of the model training.

4. Based on the session data from the gathered events, compute product feature matrix ϕ[p_(t)* , r].

5. There may be events where some recommenders did not produce signal greater than ϵ for some products or produced the same signal value for many products. Such events are less valuable for learning the model weights. Thus, the target number of training samples p_(t) is selected and only the best p_(t) events are kept:

ϕ[p _(t) ,r]=remove_irrelevant_rows(ϕ[p _(t) *,r])  (33)

6. Scores S_(t)(i) assigned to every event which passed the filter define target scores vector S_(t)[p_(t), 1].

7. Transform features to make them compatible with linear models (number of features {tilde over (r)} in a linear model may differ from the number of recommendation signals r):

Φ[p _(t) ,{tilde over (r)}]=transform_features(ϕ[p _(t) ,r])  (34)

Use a linear model to directly estimate the weights

[{tilde over (r)}, 1].

For example, normal equation for linear regression with regularization term α allows to compute the weights as follows:

W[{tilde over (r)},1]=(Φ^(T)[{tilde over (r)},p _(t)]·Φ[p _(t) ,{tilde over (r)}]++II[{tilde over (r)},{tilde over (r)}])⁻¹·Φ^(T)[{tilde over (r)},p _(t)]·S[p _(t),1]  (35)

SVM with a linear kernel also allows for directly computing the weights using quadratic programming methods.

9. Use the computed weights in product scoring until the next time the weights are recomputed.

10. Incremental weight updates may be computed via the gradient of a loss function used to train the linear model.

For example, for mean squared error (MSE) regression and learning rate η:

$\begin{matrix} {{\nabla{{MSE}_{s}\left( {W\left\lbrack {\overset{\sim}{r},1} \right\rbrack} \right)}} = {\frac{2}{p_{t}}{{\varphi^{T}\left\lbrack {\overset{\sim}{r},p_{t}} \right\rbrack} \cdot \left( {{{\varphi \left\lbrack {p_{t},\overset{\sim}{r}} \right\rbrack} \cdot {W\left\lbrack {\overset{\sim}{r},1} \right\rbrack}} - {S_{t}\left\lbrack {p_{t},1} \right\rbrack}} \right)}}} & \left( {36a} \right) \\ {\mspace{79mu} {{W_{new}\left\lbrack {\overset{\sim}{r},1} \right\rbrack} = {{\left( {1 - {\alpha \; \eta}} \right){W\left\lbrack {\overset{\sim}{r},1} \right\rbrack}} - {\eta {\nabla{{MSE}_{S}\left\lbrack {\overset{\sim}{r},1} \right\rbrack}}}}}} & \left( {36b} \right) \end{matrix}$

Linear SVM gradients can be similarly applied.

11. For online learning of weights, gradients can be periodically applied between full weight recalculations.

12. Live ranking application first computes product features ϕ[p, r], then applies feature transformation to obtain Φ[p, {tilde over (r)}], then applies computed weights W[{tilde over (r)}, 1] to obtain product scores S[p, 1].

Sparse Features Problem

Features ϕ[p_(t), r] are obtained from black-box recommendation signals. The quality of these signals may vary. Some signals may be present only for a few products, and some signals may produce the same value for many products. It is, thus, necessary to determine a way to implement equation (33) from step 5 described above.

If a signal value is negligible (ϵ) or a signal value is the same (similar, within tolerance) for a group of products, applying any weight to such a signal would not differentiate these products' ranks. Therefore, it is only necessary to compute the weight of such a signal based on events where it's neither negligible nor same for all products.

Given a raw training set of product events p_(t)*, the task is to identify p_(t)⊆p_(t)*, where features ϕ[p_(t), r] are most informative for signal weights calculation. According to an exemplary embodiment, the approach is to consider the rarity of each signal, and the strength of signals per product. The common approach to score items (in this case product events p_(t)*) with respect to the signal rarity and strength is TF-IDF.

To compute TF-IDF ranks for the product set p_(t)*, it would be necessary to convert matrix ϕ[p_(t)*, r] into a term matrix. The conversion is performed by an element-wise function:

$\begin{matrix} {{\overset{\sim}{\varphi}\left\lbrack {p_{t}^{*},r} \right\rbrack} = {{as\_ term}\left( {\varphi \left\lbrack {p_{t}^{*},r} \right\rbrack} \right)}} & (37) \\ {{{standardize}\left( {\varphi \left( {i,j} \right)} \right)} = \frac{{\varphi \left( {i,j} \right)} - {{mean}\left( {\varphi (j)} \right)}}{1.0 + {{std}\left( {\varphi (j)} \right)}}} & (38) \\ {{\varphi_{std}\left\lbrack {p_{t}^{*},r} \right\rbrack} = {{standardize}\left( {\varphi \left\lbrack {p_{t}^{*},r} \right\rbrack} \right)}} & (39) \\ {{{as\_ term}\left( {\varphi \left( {i,j} \right)} \right)} = \left\{ \begin{matrix} {\epsilon,{{\varphi \left( {i,j} \right)} \leq \epsilon}} \\ {{{round}\left( {{\varphi_{std}\left( {i,j} \right)}*g} \right)},{{\varphi \left( {i,j} \right)} > \epsilon}} \end{matrix} \right.} & (40) \end{matrix}$

where g is the granularity hyperparameter, std is the standard deviation, round is the scalar function rounding to the nearest integer.

According to an exemplary embodiment, g is selected by:

$\begin{matrix} {g = \frac{1}{\min\left( {{abs}\left( {{\varphi_{std}\left( {i,j} \right)},{\forall{1 \leq i \leq p_{t}^{*}}},{\forall{1 \leq j \leq {r:{{\varphi_{std}\left( {i,j} \right)} \neq 0}}}}} \right)} \right.}} & (41) \end{matrix}$

Then, a list of all unique terms in the term matrix is built:

T[1,u]=unique_elements({tilde over (ϕ)}[p _(t) *,r])  (42)

An inverse document frequency (IDF) vector is computed as follows:

$\begin{matrix} {{{{idf}\left( \overset{\sim}{\varphi} \right)}\left\lbrack {1,u} \right\rbrack} = {\overset{u}{\bigcup\limits_{k = 1}}\left( {1 + {\ln \frac{p_{t}^{*}}{1 + {\sum\limits_{i = 1}^{p_{t}^{*}}{\sum\limits_{j = 1}^{r}\left\{ \begin{matrix} {1,} & {{\overset{\sim}{\varphi}\; \left( {i,j} \right)} = {T(k)}} \\ {0,} & {{\overset{\sim}{\varphi}\; \left( {i,j} \right)} \neq {T(k)}} \end{matrix} \right.}}}}} \right)}} & (43) \end{matrix}$

Row-wise TF-IDF scores for every product in is computed by summing IDF scores of every term in every row:

$\begin{matrix} {{{{tfidf}\left( \overset{\sim}{\varphi} \right)}\left\lbrack {p_{t}^{*},1} \right\rbrack} = {\bigcup\limits_{i = 1}^{p_{t}^{*}}\left( {\sum\limits_{j = 1}^{r}\; {\sum\limits_{k = 1}^{u}\; \left\{ \begin{matrix} {{{{idf}\left( \overset{\sim}{\varphi} \right)}(k)},} & {{\overset{\sim}{\varphi}\left( {i,j} \right)} = {T(k)}} \\ {0,} & {{\overset{\sim}{\varphi}\left( {i,j} \right)} \neq {T(k)}} \end{matrix} \right)}} \right.}} & (44) \end{matrix}$

The subset ϕ[p_(t), r]⊆ϕ[p_(t), r] can be selected as those rows of ϕ[p_(t), r] that have the highest scores in tfidf ({tilde over (ϕ)})[p_(t)*, 1].

Thus, the algorithm described in this section may be denoted as:

ϕ[p _(t) ,r]=tfidf_rank(ϕ[p _(t) *,r],g)  (45)

Referring back to the example described above, using the sample data used in the example, term matrix as_term(ϕ)[10,5] is calculated for all products, taking a granularity constant g=100:

Product ID {tilde over (ϕ)}(i, 1) {tilde over (ϕ)}(i, 2) {tilde over (ϕ)}(i, 3) {tilde over (ϕ)}(i, 4) {tilde over (ϕ)}(i, 5) mean(ϕ(j)) 0.1 0.1 0.2 0.2489 0.1534 std(ϕ(j)) 0.1202 0.2134 0.1146 0.1054 0.1933 1 10⁻¹² 10⁻¹² 16 14 −7 2 10⁻¹² 47 −7 −11 −13 3 10⁻¹² 10⁻¹² 16 16 −7 4 10⁻¹² 10⁻¹² −7 −11 −13 5 21 10⁻¹² −7 3 −7 6 15 10⁻¹² −7 4 −7 7 9 10⁻¹² 16 2 9 8 3 19 −7 −4 9 9 −3 10⁻¹² −7 0 −7 10 10⁻¹² 10⁻¹² −7 −13 43

IDF vector of unique terms are obtained from the term matrix:

T(k) IDF(k) 10⁻¹² 0.66 −13 1.92 −11 2.20 −7 0.74 −4 2.61 −3 2.61 0 2.61 2 2.20 3 2.61 4 2.61 9 1.92 14 2.61 15 2.61 16 1.69 19 2.61 21 2.61 43 2.61 47 2.61

tfidf({tilde over (ϕ)}) are obtained by summing IDF values of unique terms rowwise:

Product TFIDF ID ({tilde over (ϕ)})(i) Rank 1 6.37 7 2 8.13 3 3 5.45 9 4 6.18 8 5 6.95 5 6 7.36 4 7 8.39 2 8 10.48 1 9 7.36 4 10 6.59 6

For p_(t)=5, rows 8, 7, 2, 6, and 9 are selected.

Linear Formulation for Product-Based Pointwise Training

Based on the scoring function being quadratic, a polynomial feature extraction is applied to obtain a linear model:

$\begin{matrix} \begin{matrix} {{s\left\lbrack {p,1} \right\rbrack} = {\bigcup\limits_{k = 1}^{p}\left( {\sum\limits_{i = 1}^{r}\; {{\varphi \left( {k,i} \right)}{W_{p}(i)}*{\sum\limits_{j = 1}^{r}\; {{\varphi \left( {k,j} \right)}{W_{a}(j)}}}}} \right)}} \\ {= {\bigcup\limits_{k = 1}^{p}{\sum\limits_{i = 1}^{r}\; {\sum\limits_{j = 1}^{r}\; {\left( {{\varphi \left( {k,i} \right)}{\varphi \left( {k,j} \right)}} \right)*\left( {{W_{p}(i)}{W_{a}(j)}} \right)}}}}} \\ {= {\bigcup\limits_{k = 1}^{p}{\left( {{\varphi (k)} \otimes {\varphi (k)}} \right) \cdot \left( {W_{p} \otimes W_{a}} \right)}}} \end{matrix} & (46) \end{matrix}$

From this, the feature transformation function is obtained:

$\begin{matrix} {{\Phi \left\lbrack {p,r^{2}} \right\rbrack} = {{{transform\_ features}\left( {\varphi \left\lbrack {p,r} \right\rbrack} \right)} = {\bigcup\limits_{k = 1}^{p}{{\varphi (k)} \otimes {\varphi (k)}}}}} & (47) \end{matrix}$

Accordingly, pointwise training discipline can be applied, as described above. A disadvantage of this formulation is that, even though it allows a linear model to be trained, its weights would be:

W=W _(p) ⊗W _(a)  (48)

The above would not allow for determining W_(p) and W_(a) used in attribute scoring. Thus, two linear models may be built, instead of a single model, in order to access weights W_(p) and W_(a).

Linear Formulation for Attribute-Based Pointwise Training

Based on (2b), (3a) and (3c), the product scores S[p, 1] may be represented in a different way:

S[p,1]=

₂(

_(p)(ϕ_(p)(Q),W _(p))[p,1],

_(a2)(ϕ_(a2)(P,M,R),W _(a))[n,1])  (49)

Given the definitions of ϕ_(p) and

_(p)=B_(∈)[p, 1] computed so far, the remaining definitions can be obtained by the following substitutions:

₂(f,g)=f×((P|M)_(∈) ·g)

_(a2)(P,M,R,W _(a))[n,1]=U[n,r _(a)]·

[

,1]

ϕ_(a2)[n,r _(a)]=U[n,r _(a)]  (50a-50c)

wherein

₂ is composition via element-wise multiplication of the argument f and the modified argument g.

_(a2)(ϕ_(a2)) is, therefore, a linear function with unknown weights W_(a), so that the modified pointwise training discipline can be applied to train a linear model:

_(a2)(ϕ_(a2))≈linear_model(ϕ_(a2))  (51)

In the attribute-based approach, a scoring model is defined to act not on the products but on the product attributes.

_(a2) (ϕ_(a2)) computes the scores of product attributes, and vector ϕ_(a2) represents the input features.

The modification to the pointwise training discipline described above considers product attribute selection events instead of product selection events. Every product view, ATB, and purchase event can be converted into attribute selection events for all the selected product's attribute values. Additionally, facet selection events may be considered. HRM for attribute selection events should assign higher relevance to those attribute values that contributed to a customer's progress through the shopping funnel.

For example: a customer searches for “jeans”; then filters search results on brand “Calvin Klein” and applies additional filters on color “blue” and size “small”; views five products, then adds two of them to the shopping bag, and purchases one.

The attribute HRM should increase the target relevance score of BRAND=“Calvin Klein” based on the following factors:

-   -   It was the facet filter applied in 3 keyword search events in         the same session.     -   It was the attribute value of five viewed products.     -   It was the attribute value of two products added to bag.     -   It was the attribute value of a purchased product.

After

_(a2) approximation model has been trained, its predictions for every possible attribute value can be used to obtain product features for the second linear model. From this, the feature transformation function is obtained:

$\begin{matrix} \begin{matrix} {{\Phi \left\lbrack {p,r_{p}} \right\rbrack} = {{transform\_ features}\left( {\varphi \left\lbrack {p,r} \right\rbrack} \right)}} \\ {= {{\varphi_{p}\left\lbrack {p,r_{p}} \right\rbrack} \times \left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \cdot {{\mathcal{F}_{a\; 2}\left( {\varphi_{a\; 2}\left\lbrack {p,r_{a}} \right\rbrack} \right)}\left\lbrack {n,1} \right\rbrack}} \right)}} \end{matrix} & (52) \end{matrix}$

Thus, pointwise training discipline may be applied as described above. The first linear model provides the weights W_(a), which are required for attribute ranking, and the second model provides the weights W_(p).

Referring back to the example described above, attribute-based features ϕ_(a2)[n, r_(a)] have already been calculated above (see U[20, 3] above):

Attribute Value ϕ_(a2)(1) ϕ_(a2)(2) ϕ_(a2)(3) Shoes 10⁻¹² 0.06 10⁻¹² Pants 10⁻¹² 0.06 10⁻¹² Dress 10⁻¹² 0.06 10⁻¹² Blender 10⁻¹² 0.02 10⁻¹² Nike 10⁻¹² 0.125 0.0667 Adidas 0 0 0 Billabong 10⁻¹² 0.05 0.2667 Kitchenaid 10⁻¹² 0.025 0.6667 Black 0.125 0.0462 10⁻¹² White 0.125 0.0615 10⁻¹² Red 0.125 0.0462 10⁻¹² Blue 0.125 0.0462 10⁻¹² Leather 10⁻¹² 0.0444 10⁻¹² Suede 10⁻¹² 0.0222 10⁻¹² Denim 10⁻¹² 0.0222 10⁻¹² Chino 10⁻¹² 0.0444 10⁻¹² Silk 10⁻¹² 0.0444 10⁻¹² Nylon 10⁻¹² 0.0222 10⁻¹² Oxford 0.25 0.1333 10⁻¹² Maxi 0.25 0.0667 10⁻¹²

When a customer applies a facet refinement—views, ATBs, or purchases a product—the three numbers associated with the refinement, or the product attribute values, may be logged in the clickstream event log. This log may be used as an input into training of the ML model.

Referring back to the example described above, the product features have already been calculated (see ϕ(1) and ϕ(2) above). The additional feature component (P|M)_(∈)·

_(a2)(ϕ_(a2)) may be calculated based on the output of the facet value scoring model. For example, a linear approximation of

_(a2)(ϕ_(a2)) has been trained, which learned the following W_(a)[3, 1]=(0.25, 0.25, 0.5):

Attribute Value ≈ 

_(a2) Shoes 0.015 Pants 0.015 Dress 0.015 Blender 0.005 Nike 0.0646 Adidas 0 Billabong 0.1459 Kitchenaid 0.3396 Black 0.0428 White 0.0467 Red 0.0428 Blue 0.0428 Leather 0.0111 Suede 0.0056 Denim 0.0056 Chino 0.0111 Silk 0.0111 Nylon 0.0056 Oxford 0.0958 Maxi 0.0792

Finally, the following transformed product features are obtained in this example:

Product (P|M)_(ϵ)· ID

_(a2)(ϕ_(a2)) Φ(i,1) (Φi,2) 1 0.2293 0 0 2 0.0634 0 0.0423 3 0.2332 0 0 4 0.0634 0 0 5 0.1335 0.0445 0 6 0.1373 0.0366 0 7 0.2952 0.0590 0 8 0.2148 0.0286 0.0716 9 0.1280 0.0085 0 10 0.3912 0 0

When a customer clicks on (e.g., adds to bag, purchases, etc.) one of the products from the results page, the two numbers associated with this product may be logged in the clickstream event log. This log may be used as an input into training of the ML model.

Derivation of Attribute Scoring Function

As shown above in (4),

_(a2) represents the vector of attribute value scores V[n, 1]

V[n,1]U[n,r _(a)]·

[

,1]  (53)

Given the attribute value scores V[n, 1], the attribute scores can be obtained by summing each block:

(V)==A[a,1]M[a,n]·V[n,1]  (54)

Referring back to the example described above, the attribute value scores have already been calculated:

Attribute Value V = U · Ŵ_(a) Shoes 0.02 Pants 0.02 Dress 0.02 Blender 0.0067 Nike 0.0639 Adidas 0 Billabong 0.1056 Kitchenaid 0.2306 Black 0.0571 White 0.0622 Red 0.0571 Blue 0.0571 Leather 0.0148 Suede 0.0074 Denim 0.0074 Chino 0.0148 Silk 0.0148 Nylon 0.0074 Oxford 0.1278 Maxi 0.1056

From these attribute value scores, attribute scores may be obtained as follows:

Attribute A TYPE 0.0667 BRAND 0.4000 COLOR 0.2334 MATERIAL 0.0666 STYLE 0.2333

For example, when attributes and attribute values are presented to the customer in selectable facets, the attribute and attribute value ranks may be selected naively based on the scores A and V, so that highest score corresponds to lowest (i.e., most important) rank.

In embodiments, two other ranking strategies may be more useful for guiding the customer to a certain goal: cumulative and discriminative attribute ranking.

Attribute Ranking

Given the product scores S[p, 1], attribute value scores V[n, 1], attribute scores A[a, 1], and attribute metadata M[a, n], the problem of ranking is the problem of finding a hierarchical ranking function (see Appendix below) which would produce a vector of attribute value ranks (or indexes) I[n, 1].

With this notation, the problem of ranking attribute values is the problem of finding the following function ℑ_(S):

ℑ_(S)(V[n,1],M[a,n],ORDER)=I[n,1]  (55)

In accordance with an exemplary embodiment of the present invention, there may be at least two possible approaches to ranking attribute values: cumulative and discriminative.

In cumulative ranking approach, one may seek the most relevant filter for the entire result set, so that narrowing by this filter leaves overall more relevant products than before applying the filter.

In discriminative ranking approach, one may seek a filter which would increase the score difference between the two given subsets of the search result, e.g. between the top two products.

Cumulative Attribute Ranking

In accordance with an exemplary embodiment of the present invention, a goal may be set to find the most “important” attribute value, removal of which from the computation of product scores reduces the cumulative relevance of the product set by the highest margin. The total score of the product set is defined as the cumulative gain (CG) of the product set. The attribute with the highest contribution to the overall score will reduce the score by the largest number when removed from the calculations, so it will be the attribute which decreases the CG the most when removed from the calculations.

The scores of attribute values are defined by the vector V[n, 1], and their total contribution to product scores may be calculated by element-wise vector multiplication, as shown below in (56):

$\begin{matrix} {{S\left\lbrack {p,1} \right\rbrack} = {{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \cdot {V\left\lbrack {n,1} \right\rbrack}} \right) \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} = {{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} \right) \cdot {V\left\lbrack {n,1} \right\rbrack}}\overset{{per}\text{-}{value}}{\Rightarrow}{{V\left\lbrack {n,1} \right\rbrack} \times {\sum\limits_{p}{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} \right)^{T}\left\lbrack {n,p} \right\rbrack}}}}}} & (56) \end{matrix}$

Thus, for ranking based on maximum contribution, the vector of attribute value ranks I[n, 1] can be expressed as:

$\begin{matrix} {{I\left\lbrack {n,1} \right\rbrack} = {{hi\_ sort}\left( {{{V\left\lbrack {n,1} \right\rbrack} \times {\sum\limits_{p}{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} \right)^{T}\left\lbrack {n,p} \right\rbrack}}},{M\left\lbrack {a,n} \right\rbrack},{DESC}} \right)}} & (57) \end{matrix}$

Based on the example described above, the following table illustrates exemplary comparison between naïve and cumulative attribute value ranks:

  Attribute Value   V $\sum\limits_{p}\left( {\left( P \middle| M \right)_{\epsilon} \times B_{\epsilon}} \right)^{T}$ $V \times {\sum\limits_{p}\left( {\left( P \middle| M \right)_{\epsilon} \times B_{\epsilon}} \right)^{T}}$ Naïve In- Block-Rank Cumulative In-Block-Rank Shoes 0.02 0.3333 0.0067 1 2 Pants 0.02 0.3 0.006 1 3 Dress 0.02 0.3666 0.0073 1 1 Blender 0.0067 2*10⁻¹² 0.0133*10⁻¹² 2 4 Nike 0.0639 0.3333 0.213 3 2 Adidas 0 0.3333 0 4 4 Billabong 0.1056 0.3333 0.0352 2 1 Kitchenaid 0.2306 2*10⁻¹² 0.4611*10⁻¹² 1 3 Black 0.0571 0.4 0.0228 2 1 White 0.0622 0.1666 0.0104 1 3 Red 0.0571 0.3666 0.0209 2 2 Blue 0.0571 0.0666 0.0038 2 4 Leather 0.0148 3*10⁻¹² 0.0444*10⁻¹² 1 5 Suede 0.0074 0.3333 0.0025 2 3 Denim 0.0074 2*10⁻¹² 0.0148*10⁻¹² 2 6 Chino 0.0148 0.3 0.0044 1 2 Silk 0.0148 0.3333 0.0049 1 1 Nylon 0.0074 0.0333 0.0002 2 4 Oxford 0.1278 3*10⁻¹² 0.3833*10⁻¹² 1 2 Maxi 0.1056 0.1 0.0106 2 1

In addition, based on the example described above, the following table illustrates exemplary comparison between naïve and cumulative attribute ranks:

  Attribute Value   A (V) $\left( {V \times {\sum\limits_{p}\left( {\left( P \middle| M \right)_{\epsilon} \times B_{\epsilon}} \right)^{T}}} \right.$   Naïve Rank Cumulative Rank TYPE 0.0667 0.02 4 3 BRAND 0.4000 0.0565 1 2 COLOR 0.2334 0.058 2 1 MATERIAL 0.0666 0.0121 5 4 STYLE 0.2333 0.0106 3 5

Discriminative Attribute Ranking

In accordance with an exemplary embodiment of the present invention, two groups of products p₁ and p₂ are selected from p based on their scores in S[p, 1]:

P ₁[p ₁ ,n]∪P ₂[p ₂ ,n]⊆P[p,n],p ₁ +p ₂ ≤p  (58)

Examples of such two groups of products include, but are not limited to: first and second scoring products; first and last scoring products; top half and bottom half scoring products; top x and bottom x scoring products, etc. With these two groups of products, their group scores can be defined. Since group scores are calculated, there is no need to track scores of individual products in each group. As such group vectors are defined as shown below in (59a, 59b):

$\begin{matrix} {{{G_{1}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{1}}{\mathcal{F}_{p_{1}}\left\lbrack {p_{1},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}{\left( {PM} \right)_{\epsilon_{1}}\left\lbrack {p_{1},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}{{G_{2}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{2}}{\mathcal{F}_{p_{2}}\left\lbrack {p_{2},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}{\left( {PM} \right)_{\epsilon_{2}}\left\lbrack {p_{2},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}} & \left( {{59a},{59b}} \right) \end{matrix}$

The corresponding group scores can be obtained as shown below in (60a, 60b):

S ₁[1,1]=G ₁[1,n]·V[n,1]

S ₂[1,1]=G ₂[1,n]·V[n,1]  (60a, 60b)

The goal of the discriminative ranking is to increase the difference of these scores:

d=S ₁ −S ₂, where S ₁ ≥S ₂,  (61)

In accordance with an exemplary embodiment of the present invention, individual attribute contribution to the score difference is maximized. In order to maximize the score difference:

d=S ₁ −S ₂ =G ₁[1,n]·V[n,1]−G ₂[1,n]·V[n,1]  (62)

the following (63) must be maximized by removing only one term from this sum:

$\begin{matrix} {{\left( {{G_{1}\left\lbrack {1,n} \right\rbrack} - {G_{2}\left\lbrack {1,n} \right\rbrack}} \right) \cdot {V\left\lbrack {n,1} \right\rbrack}} = {{\Delta \; {{G\left\lbrack {1,n} \right\rbrack} \cdot {V\left\lbrack {n,1} \right\rbrack}}} = {\sum\limits_{i = 1}^{n}\; {\left( {{G_{1}(i)} - {G_{2}(i)}} \right)*{V(i)}}}}} & (63) \end{matrix}$

Removing the smallest term (which may be a negative value) will maximize the remaining sum. The expression (G₁−G₂)^(T)[n, 1]×V[n, 1] gives the vector of terms of the above sum.

Because the smallest term is required to produce the highest rank, it is necessary to rank the values in ascending (ASC) order. However ASC order would rank the highest those attribute values which are not present in any of the products or not recommended by any of the recommenders. So it may be necessary to modify the vector of terms of the above sum to meet the following criteria:

(a) Attribute values with zeros in V[n, 1] should be ranked last, as in:

I[n,1]=hi_sort(V[n,1],M[a,n],DESC),  (64)

(b) Given that both G₁[1, n] and G₂[1, n] have no zero elements, the rest of the attribute values with non-zeros in V[n, 1] should be ranked in ASC order of the sum's terms (which may be negative). To convert the sorting order to descending (DESC), ΔG must be normalized so that the elements of 1[n, 1]−

are in range [0,2]:

I[n,1]=hi_sort(V[n,1]×(1[n,1]−

^(T))[n,1],M[a,n],DESC)  (65)

The above three strategies may be combined by using the vector ε[n, 1] to facilitate tie-break:

I[n,1]=hi_sort(V[n,1]×(1+ε−

^(T))[n,1],M[a,n],DESC)  (66)

To compute discriminative ranks, it may be necessary to first decide on product groupings p₁ and p₂. For example, let's assume that a goal is to find such attribute values which improve the score of the top half of the search results (p₁) relative to the score of the bottom half of the search result (p₂). Referring back to the example described above, where the product ranks have been calculated as follows:

Product ID Rank 1 9 2 3 3 8 4 10 5 4 6 5 7 2 8 1 9 6 10 7

Based on these ranks, the top half consists of products p₁={8, 7, 2, 5, 6}, and bottom half consists of products p₂={9, 10, 3, 1, 4}.

Delta group vector can be calculated as follows, using (67):

$\begin{matrix} {{\Delta \; {G\left\lbrack {1,n} \right\rbrack}} = {{{\left( {\sum\limits_{p_{1}}B_{\epsilon}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}} - {{\left( {\sum\limits_{p_{2}}B_{\epsilon}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}}}} & (67) \end{matrix}$

Naïve In- Cumulative Discriminative Attribute Block- In-Block- In-Block- Value ΔG^(T) 1 + ϵ −

 ^(T)) V V × (1 + ϵ −

 ^(T)) Rank Rank Rank Shoes 0.9 0.9553 0.02 0.0191 1 2 1 Pants 1.9 0.9056 0.02 0.0181 1 3 2 Dress 1.9 0.9056 0.02 0.0181 1 1 2 Blender −0.0333 1.0017 0.0067 0.0067 2 4 3 Nike 1.8333 0.9089 0.0639 0.0581 3 2 3 Adidas 0.9333 0.9536 0 0 4 4 4 Billabong 1.9332 0.9040 0.1056 0.0954 2 1 2 Kitchenaid −0.0333 1.0017 0.2306 0.2309 1 3 1 Black 1.9 0.9056 0.0571 0.0517 2 1 4 White 1.2222 0.9393 0.0622 0.0584 1 3 1 Red 1.2721 0.9368 0.0571 0.0535 2 2 3 Blue 0.2722 0.9865 0.0571 0.0563 2 4 2 Leather −0.0666 1.0033 0.0148 0.0148 1 5 1 Suede 0.9666 0.9520 0.0074 0.0070 2 3 4 Denim −0.0333 1.0017 0.0074 0.0074 2 6 3 Chino 1.9332 0.9040 0.0148 0.0134 1 2 2 Silk 1.9332 0.9040 0.0148 0.0134 1 1 2 Nylon −0.0333 1.0017 0.0074 0.0074 2 4 3 Oxford −0.0666 1.0033 0.1278 0.1282 1 2 1 Maxi 0.9666 0.9520 0.1056 0.1005 2 1 2

The following table illustrates comparison between naïve, cumulative ranks (already computed above) and discriminative attribute ranks computed in this example:

Attribute

 (V × (1 + Naïve Cumulative Discriminative Value ε −

 ^(T))) Rank Rank Rank TYPE 0.0620 4 3 5 BRAND 0.3845 1 2 1 COLOR 0.2198 2 1 3 MATERIAL 0.0635 5 4 4 STYLE 0.2287 3 5 2

Hybrid Attribute Ranking

The two ranking functions for cumulative and discriminating ranking methods can be combined with certain weights to achieve a hybrid ranking. For example, the two ranking functions may be combined by using a hybrid weight vector H[2,1] as shown below in (68):

$\begin{matrix} {{I\left\lbrack {n,1} \right\rbrack} = {{hi\_ sort}\left( {\left( {{\left( {V \times {\sum\limits_{p}\left( {\left( {PM} \right)_{\epsilon} \times B_{\epsilon}} \right)^{T}}} \right)\left\lbrack {n,1} \right\rbrack}\bigcup{\left( {V \times \left( {1 + ɛ - {\hat{\Delta \; G}}^{T}} \right)} \right)\left\lbrack {n,1} \right\rbrack}} \right)\left. \quad{{\left\lbrack {n,2} \right\rbrack \cdot {H\left\lbrack {2,1} \right\rbrack}},{M\left\lbrack {a,n} \right\rbrack},{ASC}} \right)} \right.}} & (68) \end{matrix}$

In embodiments, many signals can be used as attribute recommenders R[n, r_(a)]. All attribute recommenders may be divided in two families: personalized and non-personalized. Personalized attribute recommenders may require Customer Record ID as a mandatory input. Non-personalized recommenders may work generically.

Examples of personalized attribute recommenders may include, but not limited to:

Attribute Values corresponding to the known properties of the customer, e.g., age, gender, sizes etc.;

Attribute Values of Repeated Prior Purchases;

Attribute Values of products recommended based on the Prior Purchases;

Collaborative Shopping Model in an open source Machine Learning Server, such as PredictionIO; and

Attribute Value (e.g., brand) Propensity models, for example, are machine learning models configured to predict the likelihood of a particular customer purchasing a particular brand within a certain period of time.

Examples of non-personalized attribute recommenders may include, but not limited to:

Facet and FacetValue ordering from a Merchandising Rule which is configured to be activated when its Trigger Condition is satisfied.

Facet and Facet Value with the highest product count in the search result.

In accordance with an exemplary embodiment, let's assume that there exists a predetermined association between some anchor attribute's values and some other attribute's recommended values, for example, an association between the values of PRODUCT_TYPE (e.g., “jeans,” “shirts,” “dresses,” etc.) and the facets configured per PRODUCT_TYPE in one or more Merchandising Rules. This association may be used to compute a blended recommendation for the associated attribute values using percentages of the anchor attribute values (e.g., 30% jeans and 70% pants) in the result set.

Term frequency-inverse document frequency (TF-IDF) weighting scheme or other Keyword Search scoring of the search results based on the text index of the Facet Clicks or based on the text index of the Attributes of Purchased Products from the search logs.

Doc2vec (an extension of word2vec neural network models) or other Vector Retrieval scoring of the search results based on the vectorized index of the Facet Clicks or based on the vectorized index of the Attributes of Purchased Products from the search logs.

Confidence scoring of a Classification ML model (e.g., Tensorflow), trained to assign Attribute Values to the search phrase as Classification Labels, based on the Facet Clicks or based on the Attributes of Purchased Products from the search logs.

Keyword search of Facet Values based on matching keywords from an unsupervised ML topic (e.g., LDA); Topic corresponding to the search phrase to the keywords present in the Facet Values.

In embodiments, if it is not technically feasible to apply the MARLENE scoring algorithm to the entire search result, and reorder all products, the scoring may be limited to the top K results (e.g., typically around a few hundred results). To make sure that top K results are relevant to begin with, some of the attribute recommendations may be applied as pre-filters. Such pre-filters are called “Implicit Filters”, and may, for example, be generated from the unchangeable attributes of the customer known from the user profile data, such as gender, age, and size.

In embodiments, many signals can be used as product recommenders Q[p, r_(p)]. All product recommenders may be divided into two families: personalized and non-personalized. Personalized product recommenders may require a Customer Record as a mandatory input. Non-personalized recommenders may work generically.

Examples of personalized product recommenders may include, but are not limited to:

Repeated Prior Purchases;

Recommendations based on the Prior Purchases; and

Various (image, name, description, etc.) similarity scores based on the Prior Purchases.

Examples of non-personalized product recommenders may include, but are not limited to:

Natural Relevancy Score;

Weighted blend of various business metrics associated with products in the result set;

Machine-learned recommendation score based on various business metrics associated with products in the result set; and

Relevancy Score from a Vector Search Engine, such as a doc2vec index.

In embodiments, a search engine database's API may be used to compose a white list of facetable attributes, and their values.

In embodiments, depending on the configuration settings, a whitelist of only those attribute values which are present in the facets selected by the fired Merchandising Rule, may be required. Alternatively, those facet values may be used as an Attribute-Recommending signal with a heavy weight.

In the context of computing attribute value recommendations, the following use cases may require blacklisting:

Those attributes which correspond to the unsupported use cases in some embodiments, such as PRICE (which requires bucketing);

All attributes which are directly applied as implicit filters;

All attributes which have been detected in the Search Phrase by the search phrase parser; and

All attributes present in the facet filters.

The subject matter described herein can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structural means disclosed in this specification and structural equivalents thereof, or in combinations of them. The subject matter described herein can be implemented as one or more computer program products, such as one or more computer programs tangibly embodied in a non-transitory information carrier (e.g., in a machine readable storage device), or embodied in a propagated signal, for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, multiple computers, to name a few). The subject matter described herein can be implemented in one or more local, cloud, and/or hybrid servers and/or one or more local, cloud, and/or hybrid databases. A computer program (also known as a program, software, software application, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file. A program can be stored in a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification, including the method steps of the subject matter described herein, can be performed by one or more programmable processors executing one or more computer programs to perform functions of the subject matter described herein by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus of the subject matter described herein can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit), to name a few.

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processor of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, (e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices); magnetic disks, (e.g., internal hard disks or removable disks); magneto optical disks; and optical disks (e.g., compact discs (CDs) and digital versatile discs (DVDs)). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, the subject matter described herein can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, (e.g., a mouse or a trackball), or a touchscreen, by which the user can provide input to the computer. Other kinds of devices (e.g., a smart phone, a tablet PC, to name a few) can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, (e.g., visual feedback, auditory feedback, or tactile feedback, to name a few), and input from the user can be received in any form, including acoustic, speech, or tactile input.

The subject matter described herein can be implemented in a computing system that includes a back end component (e.g., a data server), a middleware component (e.g., an application server), or a front end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described herein), or any combination of such back end, middleware, and front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.

While the invention has been described in conjunction with exemplary embodiments outlined above and illustrated in the drawings, it is evident that the principles of the present invention may be implemented using any number of techniques, whether currently known or not, and many alternatives, modifications and variations in form and detail will be apparent to those skilled in the art. Modifications, additions, or omissions may be made to the systems, apparatuses, and methods described herein without departing from the scope of the present invention. Furthermore, while the foregoing describes a number of separate embodiments of the methodology and tools of the present invention, what has been described herein is merely illustrative of the application of the principles of the present invention. For example, the appearance, the features, the inputs and outputs and the mathematical algorithms of components described herein can be varied to suit a particular application. In addition, each of the various embodiments described above may be combined with other described embodiments in order to provide multiple features. Furthermore, the operations of the systems and apparatuses disclosed herein may be performed by more, fewer, or other components and the methods described may include more, fewer, or other steps. Additionally, steps may be performed in any suitable order.

Accordingly, the exemplary embodiments of the invention, as set forth above, are intended to be illustrative, not limiting, and the spirit and scope of the present invention is to be construed broadly and limited only by the appended claims, and not by the foregoing specification.

To the extent certain functionality or components “can” or “may” be performed or included, respectively, the identified functionality or components are not necessarily required in all embodiments, and can be omitted from certain embodiments of the invention.

To the extent that the foregoing description refers to the “invention” or “present invention,” the present disclosure may include more than one invention.

APPENDIX: MATHEMATICAL TERMS AND NOTATIONS

Matrix M has k rows and n columns.

Dimensions are optionally tracked for each matrix in square brackets following the matrix symbol: M=M[k,n]

Element of a Matrix is denoted with round brackets:

M(i,j)

M[k,n](i,j)

0[k,n]1[k,n]2[k,n]3.14[k,n]

Rows and Columns of a Matrix are denoted with round brackets with a single index, Orientation (row, column) is revealed from context:

M(i)=M(i)[1,n] is a row

M(j)=M(j)[k,1] is a column

Vector is a matrix with one row or one column:

V[1,n] is a row vector

V[k,1] is a column vector

Element of a Vector is denoted with round brackets:

V(i)

V[1,n](i)

V[k,1](i)

M(i,j)=M(i)[1,n](j)=M(j)[k,1](i)

Transposing a matrix or a vector changes its dimensions in place:

M[k,n]⇒M ^(T)[n,k]

V[1,n]⇒V ^(T)[n,1]

Scalar Broadcasting fills a matrix with all elements equal to the same number:

0[k,n]

1[k,n]

2[k,n]

3.14[k,n]

etc.

Sum of two or more matrices happens element-wise and requires all dimensions of the matrices to be the same:

A[x,y]+B[x,y]=C[x,y]

For More than Two Matrices:

${M\left\lbrack {k,n} \right\rbrack} = {\sum\limits_{i}{M_{i}\left\lbrack {k,n} \right\rbrack}}$

Multiplication (Hadamard Product) of two or more matrices happens element-wise and requires all dimensions of the matrices to be the same:

A[x,y]×B[x,y]=C[x,y]

For More than Two Matrices:

${M\left\lbrack {k,n} \right\rbrack} = {\underset{i}{\Pi}{M_{i}\left\lbrack {k,n} \right\rbrack}}$

Multiplication of a matrix and a vector happens element-wise across all rows or columns, depending on which dimension is matching:

A[x,y]×B[1,y]=C[x,y]⇒row-wise multiplication

A[x,y]×B[x,1]=C[x,y]⇒column-wise multiplication

Multiplication of a matrix and a scalar happens element-wise across all elements of the matrix:

σM[k,n]=α*M[k,n]=σ[k,1]×M[k,n]=σ[1,n]×M[k,n]=σ[k,n]×M[k,n],

-   -   where σ is a scalar

Dot-Product of two vectors is a scalar:

A[1,x]·B[x,1]=σ[1,1]=σ

Matrix Dot-Product is the traditional matrix multiplication and requires inner dimensions to be the same:

A[x,y]·B[y,z]=C[x,z]

Concatenation of two matrices requires one of their dimensions to be the same:

A[x,z]∪B[y,z]=C[x+y,z]

or

A[x,y]∪B[x,z]=C[x,y+z]

or, for multiple matrices:

${M\left\lbrack {{\sum\limits_{i}k_{i}},n} \right\rbrack} = \left. {\bigcup\limits_{i}{M_{i}\left\lbrack {k_{i},n} \right\rbrack}}\Rightarrow{{stack}\mspace{14mu} {vertically}} \right.$ ${M\left\lbrack {k,{\sum\limits_{i}n_{i}}} \right\rbrack} = \left. {\bigcup\limits_{i}{M_{i}\left\lbrack {k,n_{i}} \right\rbrack}}\Rightarrow{{arrange}\mspace{14mu} {horizontally}} \right.$

Decomposition:

A matrix can be decomposed into its row or column vectors, concatenation of which gives back the original matrix:

${M\left\lbrack {k,n} \right\rbrack} = {\bigcup\limits_{i = 1}^{k}{M_{i}\left\lbrack {1,n} \right\rbrack}}$ ${M\left\lbrack {k,n} \right\rbrack} = {\bigcup\limits_{i = 1}^{n}{M_{i}\left\lbrack {k,1} \right\rbrack}}$

Identity Matrix is a diagonal matrix:

${\left\lbrack {k,n} \right\rbrack} = {\bigcup\limits_{i = 1}^{k}{\bigcup\limits_{j = 1}^{n}\left\{ \begin{matrix} {1,} & {i = j} \\ {0,} & {i \neq j} \end{matrix} \right.}}$

Linearized Outer Product of two vectors is a vector obtained by concatenating their element products along the same axis:

${\left( {{A\left\lbrack {1,x} \right\rbrack} \otimes {B\left\lbrack {1,y} \right\rbrack}} \right)\left\lbrack {1,{x*y}} \right\rbrack} = {\overset{x}{\bigcup\limits_{i = 1}}{\overset{y}{\bigcup\limits_{j = 1}}{{A(i)}*{B(j)}}}}$ ${\left( {{A\left\lbrack {x,1} \right\rbrack} \otimes {B\left\lbrack {y,1} \right\rbrack}} \right)\left\lbrack {{x*y},1} \right\rbrack} = {\overset{y}{\bigcup\limits_{j = 1}}{\overset{x}{\bigcup\limits_{i = 1}}{{A(i)}*{B(j)}}}}$

Left Join of matrix A to matrix B on a column j consists of creating a new matrix A×B with all columns from A and from B, where columns borrowed from B in those rows where A(i,j)=B(i,j) have the values from B, and zero in the rest of the rows:

${{A\left\lbrack {a,b} \right\rbrack}\mspace{14mu} \mspace{11mu} {B\left\lbrack {c,d} \right\rbrack}} = {{C\left\lbrack {a,{b + d - 1}} \right\rbrack}\text{:}\mspace{14mu} \left\{ \begin{matrix} {{{C\left( {i,{b + \ldots}}\mspace{14mu} \right)} = {{{B\left( {i,x} \right)}{\forall{i\text{:}\mspace{14mu} {\exists{{B\left( {i,j} \right)}\text{:}\mspace{14mu} {A\left( {i,j} \right)}}}}}} = {B\left( {i,j} \right)}}},{x \neq j}} \\ {{C\left( {i,{b + \ldots}}\mspace{14mu} \right)} = {{0.0{\forall{i\text{:}\mspace{14mu} {\nexists{{B\left( {i,j} \right)}\text{:}\mspace{14mu} {A\left( {i,j} \right)}}}}}} = {B\left( {i,j} \right)}}} \end{matrix} \right.}$

Element-wise application of a function consists of applying scalar-valued function to every element of a matrix:

scalar_Junction(A)[x,y]=B[x,y]

Row-wise and column-wise application of a function consists of decomposition of a matrix into row or column vectors followed by application of a function to each row or column to obtain a resulting vector followed by concatenation:

${M\left\lbrack {k,x} \right\rbrack} = {\overset{k}{\bigcup\limits_{i = 1}}{{vector\_ function}\mspace{11mu} {\left( {M_{i}\left\lbrack {1,n} \right\rbrack} \right)\left\lbrack {1,x} \right\rbrack}}}$ ${M\left\lbrack {h,n} \right\rbrack} = {\overset{n}{\bigcup\limits_{i = 1}}{{vector\_ function}\mspace{11mu} {\left( {M_{i}\left\lbrack {k,1} \right\rbrack} \right)\left\lbrack {y,1} \right\rbrack}}}$

Folding of a matrix is a row-wise or column-wise sum:

${M\left\lbrack {1,n} \right\rbrack} = {\sum\limits_{k}{M\left\lbrack {k,n} \right\rbrack}}$ ${M\left\lbrack {k,1} \right\rbrack} = {\sum\limits_{n}{M\left\lbrack {k,n} \right\rbrack}}$

Ranking Function of a vector V[k,1] is a mapping function ℑ(V[k,1],ORDER) into a vector of positions (also known as indices) I[k, 1], such that {tilde over (V)}[k, 1] represents the original vector V[k, 1] with elements reordered in accordance with the ORDER specification (e.g., ascending (ASC), descending (DESC), etc.):

${\overset{\sim}{V}\left\lbrack {k,1} \right\rbrack} = {\overset{k}{\bigcup\limits_{i = 1}}{V\left( {I(i)} \right)}}$

An example of Ranking Function is simple indexed_sort, where I[k,1] is the indices of elements of V[k, 1] in the order of magnitude.

Hierarchical Ranking Function of a vector V[k, 1], is a special case of a Ranking Function with an additional parameter M[b, k], a matrix of one-hot encoded block membership:

ℑ(V[k,1],M[b,k],ORDER)=I[k,1]

Using V[k, 1] and M[b, k], it is possible to define two additional parameters:

-   -   a vector B[b, 1]=M[b, k]·V[k, 1] of block scores, and     -   a vector M′[k, 1]=U_(i=1) ^(b) index_of_nozero(M_(i) ^(T)[k,1])         of block assignments. Each block assignment assigns a position 1         . . . k within V[k, 1] to a block 1 . . . b.

The hierarchical aspect of the ranking function takes into account two scores for each position i∈[1, k]: the original score V(i), and the block score B(M′(i)) for the block M′(i) assigned to the position i in the block assignment vector.

An example of Hierarchical Ranking Function is simple hierarchical_indexed_sort (hi_sort), where I[k,1] is the indices of elements of V[k, 1] in the order of magnitude of their block scores tie-broken by the order of magnitude of their element scores.

Block-Average Encoding:

Consider one-hot encoded vector V[k, 1]. One-hot encoding means that, if this vector describes an instance, then:

${V(i)} = \left\{ \begin{matrix} {1.0,{{if}\mspace{14mu} {the}\mspace{14mu} {instance}\mspace{14mu} {has}\mspace{14mu} {property}\mspace{14mu} i}} \\ {0.0,{{if}\mspace{14mu} {the}\mspace{14mu} {instance}\mspace{14mu} {does}\mspace{14mu} {not}\mspace{14mu} {have}\mspace{14mu} {property}\mspace{14mu} i}} \end{matrix} \right.$

Considering the vector of block assignments M′[k, 1] defined above, block-average encoding for vector V[k, 1] can be obtained from one-hot encoding by dividing each element of V[k, 1] by the sum of all elements of V[k, 1] belonging to the same group:

${\left( {VM} \right)\left\lbrack {k,1} \right\rbrack} = {\overset{k}{\bigcup\limits_{i = 1}}\frac{V(i)}{\sum\limits_{{j\text{:}\mspace{20mu} {M^{\prime}{(j)}}} = {M^{\prime}{(i)}}}^{\;}{V(j)}}}$

Block-average-encoded vectors and matrices are marked herein with a pipe symbol followed by the corresponding one-hot block assignment matrix name.

Vector Normalization consists of transforming a vector into another vector with length 1.0 with respect to some norm (e.g., a measure of vector lengths). A norm is a function mapping a vector into a scalar:

∥V[k,1]∥=N[1,1]

Normalized vectors are marked with a hat symbol:

{circumflex over (V)}[k,1]⇒∥{circumflex over (V)}∥=1.0

Vector normalization usually preserves the direction of a vector. L1- and L2-norms are used most often and they do preserve the direction of a vector. To standardize normalization algorithms, any L-norm can be used. In general, L1-norm (Manhattan distance), or L2-norm (Euclidean distance) may be sufficient.

For purposes of at least one embodiment of the present invention, it may not be necessary to preserve the direction of a vector, but normalization may be required to maintain the relative order of the vector's elements with respect to their magnitude (a weaker constraint):

indexed_sort(V)=indexed_sort({circumflex over (V)})

So a “relaxed” definition of a normalized vector can be provided as follows: “normalized vector is a vector with a given L-norm equal to 1.0, such that the relative magnitudes of vector elements correspond to the relative magnitudes of the original vector elements.” Such normalized vectors are marked with a hat symbol:

$\left. {\hat{V}\left\lbrack {k,1} \right\rbrack}\Rightarrow\left\{ \begin{matrix} {{{V\left\lbrack {k,1} \right\rbrack}}_{L} = 1.0} \\ {{{indexed\_ sort}(V)} = {{indexed\_ sort}\left( \hat{V} \right)}} \end{matrix} \right. \right.$

Many functions may fit this definition, such as

${\hat{V}\left\lbrack {k,1} \right\rbrack} = {\frac{V\left\lbrack {k,1} \right\rbrack}{\sum\limits_{i = 1}^{k}{{abs}\left( {V(i)} \right)}}\mspace{14mu} \left( {{only}\mspace{14mu} {if}\mspace{14mu} L\; 1\text{-}{norm}\mspace{14mu} {is}\mspace{14mu} {selected}} \right)}$ ${\hat{V}\left\lbrack {k,1} \right\rbrack} = {\frac{V\left\lbrack {k,1} \right\rbrack}{\sqrt[2]{\sum\limits_{i = 1}^{k}\left( {V(i)} \right)^{2}}}\mspace{14mu} \left( {{only}\mspace{14mu} {if}\mspace{14mu} L\; 2\text{-}{norm}\mspace{14mu} {is}\mspace{14mu} {selected}} \right)}$ ${\hat{V}\left\lbrack {k,1} \right\rbrack} = {\overset{k}{\bigcup\limits_{i = 1}}{\frac{e^{V{(i)}}}{\sum\limits_{j = 1}^{k}e^{V{(j)}}}\mspace{14mu} \left( {{{i.e.\mspace{14mu} {sigmoid}}\mspace{14mu} {aka}\mspace{14mu} {softmax}},{{only}\mspace{14mu} {if}\mspace{14mu} L\; 1\text{-}{norm}\mspace{14mu} {is}\mspace{14mu} {selected}}} \right)}}$

Matrix Normalization consists of normalizing rows or columns of a matrix. To distinguish row versus column normalization, the normalized dimension is marked with a hat symbol as well:

Â[{circumflex over (x)},y]=∪_(i=1) ^(x)

[1,y]⇒L-norm of each row is 1.0

Â[x,ŷ,]=∪_(i=1) ^(y)

[x,1]⇒L-norm of each column is 1.0 

What is claimed is:
 1. A computer implemented method comprising the steps of: (a) receiving, at a computer system comprising one or more computers, a search query containing a search phrase from a user device; (b) executing, by the computer system, a query on a search index database to identify matches in response to the search query; (c) computing, by the computer system, facet counts associated with the matches; (d) ranking, by the computer system, the matches; (e) selecting, by the computer system, top K of the matches, wherein K is a predetermined number; (f) receiving, at the computer system from a plurality of recommendation systems, a plurality of recommendation signals generated by the plurality of recommendation systems based at least in part on the search phrase, a customer record, the top K of the matches, and the facet counts; (g) calculating, by the computer system, recommendation scores based at least in part on a combination of the plurality of recommendation signals from the plurality of recommendation systems in accordance with a scoring function S[p, 1]: S[p,1]=B _(∈)[p,1]×((P|M)_(∈)[p,n]·U[n,r _(a)]·

[

1]), wherein B[p,1]={circumflex over (Q)}[p,

]·

[

,1] is a multiplicative boost vector, U[n,r_(a)]=((R

)|M)_(∈)[n,

]×L[n,1] is an adjusted attribute value recommendation matrix, L is whitelisting and blacklisting of attributes and attribute values, P is a product matrix P[p, n] of p products, each product having a total of n possible one-hot encoded attribute values, M is a matrix M[a, n] of one-hot encoded attribute-value to attribute assignments, Q is a first recommendation signal Q[q, r_(p)] from r_(p) product recommenders, R is a second recommendation signal R [n, r_(a)] from r_(a) attribute- and attribute-value recommenders, W_(p) is a vector W_(p)[r_(p), 1] of product recommender weights, W_(a) is a vector W_(a)[r_(a), 1] of attribute- and attribute-value recommender weights, and the recommendation signals comprise the first recommendation signal Q[q, r_(p)] and the second recommendation signal R [n, r_(a)]; (h) determining, by the computer system, an order of the top K of the matches based at least in part on the recommendation scores; (i) determining, by the computer system, an order of facets and facet values based at least in part on the recommendation scores; (j) generating, by the computer system, a search result comprising the ordered top K of the matches and the ordered facets and facet values; and (k) providing, by the computer system, the search result to the user device.
 2. The computer implemented method of claim 1, further comprising, after the step (g) and before the step (h), the step of selecting at most k from the top K of the matches that contribute at least s % of a total of the recommendation scores, wherein the s % is a predetermined percentage of the total of the recommendation scores and k≤K.
 3. The computer implemented method of claim 1, further comprising, after the step (g) and before the step (i), the step of selecting facet values that contribute at least a predetermined percentage of the total of the recommendation scores per facet.
 4. The computer implemented method of claim 1, wherein the step (g) of calculating recommendation scores comprises approximating, by the computer system using a machine learning model, a quadratic function S[p, 1]=(ϕ[p, r]·W_(p)[r, 1])×(ϕ[p, r]·W_(a)[r, 1]) to learn the product recommender weights W_(p) and the attribute- and attribute-value recommender weights W_(a) based at least in part on key-performance-indicator driving feedback signals, wherein: ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

]; ϕ_(a1)[p,r _(a)]=(P|M)_(∈)[p,n]·U[n,r _(a)]; ϕ[p,r]=ϕ_(p)[p,r _(p)]∪ϕ_(a1)[p,r _(a)], where r=r _(p) +r _(a); and a log associated with a matrix of product features ϕ is used as an input into training of the machine learning model.
 5. The computer implemented method of claim 1, wherein the step (g) of calculating recommendation scores comprises directly approximating, by the computer system using a first machine learning model, product feature weights W[{tilde over (r)}, 1] wherein: S(ϕ)[p,1]=Φ[p,{tilde over (r)}]·W[{tilde over (r)},1]; W[{tilde over (r)},1]=(Φ^(T)[{tilde over (r)},p _(t)]·Φ[p _(t) ,{tilde over (r)}]+αI[{tilde over (r)},{tilde over (r)}])⁻¹·Φ^(T)[{tilde over (r)},p _(t)]·S[p _(t),1]; {tilde over (r)} is a number of transformed features from r input features; and transformed features Φ[p, {tilde over (r)}] of a log associated with ϕ[p, r] are used as input into training of the first machine learning model.
 6. The computer implemented method of claim 5, wherein the transformed features Φ[p, {tilde over (r)}] are obtained by: $\overset{p}{\bigcup\limits_{k = 1}}{{\varphi (k)} \otimes {{\varphi (k)}.}}$
 7. The computer implemented method of claim 5, wherein the first machine learning model comprises a pointwise training discipline using a linear model.
 8. The computer implemented method of claim 1, wherein the step (g) of calculating recommendation scores comprises approximating, by the computer system using a first machine learning model, a linear function

_(a2) to learn attribute-and-attribute value weights W_(a) wherein:

_(a2)(P,M,R,W _(a))[n,1]=U[n,r _(a)]·

[

,1] ϕ_(a2)[n,r _(a)]=U[n,r _(a)]; and a log associated with ϕ_(a2) is used as an input into training of the first machine learning model.
 9. The computer implemented method of claim 8, wherein the first machine learning model comprises a pointwise training discipline using a linear model.
 10. The computer implemented method of claim 8, wherein the step (g) of calculating recommendation scores further comprises approximating, by the computer system using a second machine learning model, a feature transformation function Φ[p,r _(p)]=ϕ_(p)[p,r _(p)]×((P|M)_(∈)[p,n]·

_(a2)(ϕ_(a2)[p,r _(a)])[n,1]) to learn the product recommender weights W_(p), wherein: S[p,1]=Φ[p,r _(p)]·W _(p)[r _(p),1]; and ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

].
 11. The computer implemented method of claim 10, wherein the second machine learning model comprises a pointwise training discipline using a linear model.
 12. The computer implemented method of claim 1, wherein the step (g) of calculating recommendation scores comprises computing, by the computer system, a vector of attribute value scores V[n, 1] in accordance with: V[n,1]=U[n,r _(a)]·

[

,1].
 13. The computer implemented method of claim 12, wherein the step (g) of calculating recommendation scores further comprises computing, by the computer system, attribute scores

(V) by summing each block in accordance with:

(V)=A[a,1]=M[a,n]·V[n,1].
 14. The computer implemented method of claim 12, wherein the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of cumulative attribute value ranks I[n, 1] in accordance with: ${I\left\lbrack {n,1} \right\rbrack} = {{hi\_ sort}\; {\left( {{{V\left\lbrack {n,1} \right\rbrack} \times {\sum\limits_{p}{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} \right)^{T}\left\lbrack {n,p} \right\rbrack}}},{M\left\lbrack {a,n} \right\rbrack},{DESC}} \right).}}$
 15. The computer implemented method of claim 12, wherein the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of discriminative attribute value ranks I[n, 1] in accordance with: I[n,1]=hi_sort(V[n,1]×(1+ε−

^(T))[n,1],M[a,n],DESC), wherein ε[n, 1] is a vector having a very small value for each element to facilitate a tie-break, ${{\left( {{G_{1}\left\lbrack {1,n} \right\rbrack} - {G_{2}\left\lbrack {1,n} \right\rbrack}} \right) \cdot {V\left\lbrack {n,1} \right\rbrack}} = {{\Delta \; {{G\left\lbrack {1,n} \right\rbrack} \cdot {V\left\lbrack {n,1} \right\rbrack}}} = {\sum\limits_{i = 1}^{n}{\left( {{G_{1}(i)} - {G_{2}(i)}} \right)*{V(i)}}}}},\mspace{79mu} {{G_{1}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{1}}{\mathcal{F}_{p_{1}}\left\lbrack {p_{1},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}{\left( {PM} \right)_{\epsilon_{1}}\left\lbrack {p_{1},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ $\mspace{79mu} {{G_{2}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{2}}{\mathcal{F}_{p_{2}}\left\lbrack {p_{2},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}{\left( {PM} \right)_{\epsilon_{2}}\left\lbrack {p_{2},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ ${\Delta \; {G\left\lbrack {1,n} \right\rbrack}} = {{{\left( {\sum\limits_{p_{1}}B_{\epsilon}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}} - {{\left( {\sum\limits_{p_{2}}B_{\epsilon}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}}}$ and wherein G₁ and G₂ are group vectors for a first group of products p₁ and a second group of products p₂, respectively, the first group of products p₁ and the second group of products p₂ being selected from the p products based on their recommendation scores in S[p, 1].
 16. A computer system comprising: one or more memories comprising a search index database; one or more processors operatively connected to the one or more memories; and one or more computer readable media operatively connected to the one or more processors and having stored thereon computer instructions for carrying out the steps of: (a) receiving, at the computer system, a search query containing a search phrase from a user device; (b) executing, by the computer system, a query on a search index database to identify matches in response to the search query; (c) computing, by the computer system, facet counts associated with the matches; (d) ranking, by the computer system, the matches; (e) selecting, by the computer system, top K of the matches, wherein K is a predetermined number; (f) receiving, at the computer system from a plurality of recommendation systems, a plurality of recommendation signals generated by the plurality of recommendation systems based at least in part on the search phrase, a customer record, the top K of the matches, and the facet counts; (g) calculating, by the computer system, recommendation scores based at least in part on a combination of the plurality of recommendation signals from the plurality of recommendation systems in accordance with a scoring function S[p, 1]: S[p,1]=B _(∈)[p,1]×((P|M)_(∈)[p,n]·U[n,r _(a)]·

[

,1]), wherein B[p,1]={circumflex over (Q)}[p,

]·

[

,1] is a multiplicative boost vector, U[n,r_(a)]=((R

)|M)_(∈)[n,

]×L[n,1] is an adjusted attribute value recommendation matrix, L is whitelisting and blacklisting of attributes and attribute values, P is a product matrix P[p, n] of p products, each product having a total of n possible one-hot encoded attribute values, M is a matrix M[a, n] of one-hot encoded attribute-value to attribute assignments, Q is a first recommendation signal Q[q, r_(p)] from r_(p) product recommenders, R is a second recommendation signal R[n, r_(a)] from r_(a) attribute- and attribute-value recommenders, W_(p) is a vector W_(p)[r_(p), 1] of product recommender weights, W_(a) is a vector W_(a)[r_(a), 1] of attribute- and attribute-value recommender weights, and the recommendation signals comprise the first recommendation signal Q[q, r_(p)] and the second recommendation signal R[n,r_(a)]; (h) determining, by the computer system, an order of the top K of the matches based at least in part on the recommendation scores; (i) determining, by the computer system, an order of facets and facet values based at least in part on the recommendation scores; (j) generating, by the computer system, a search result comprising the ordered top K of the matches and the ordered facets and facet values; and (k) providing, by the computer system, the search result to the user device.
 17. The computer system of claim 16, wherein the computer instructions further carry out, after the step (g) and before the step (h), the step of selecting at most k from the top K of the matches that contribute at least s % of a total of the recommendation scores, wherein the s % is a predetermined percentage of the total of the recommendation scores and k≤K.
 18. The computer system of claim 16, wherein the computer instructions further carry out, after the step (g) and before the step (i), the step of selecting facet values that contribute at least a predetermined percentage of the total of the recommendation scores per facet.
 19. The computer system of claim 16, wherein the step (g) of calculating recommendation scores comprises approximating, by the computer system using a machine learning model, a non-quadratic function S[p, 1]=(ϕ[p, r]·W_(p)[r, 1])×(ϕ[p, r]·W_(a)[r, 1]) to learn the product recommender weights W_(p) and the attribute- and attribute-value recommender weights W_(a) based at least in part on key-performance-indicator driving feedback signals, wherein: ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

]; ϕ_(a1)[p,r _(a)]=(P|M)_(∈)[p,n]·U[n,r _(a)]; ϕ[p,r]=ϕ_(p)[p,r _(p)]∪ϕ_(a1)[p,r _(a)], where r=r _(p) +r _(a); and a log associated with a matrix of product features ϕ is used as an input into training of the machine learning model.
 20. The computer system of claim 16, wherein the step (g) of calculating recommendation scores comprises directly approximating, by the computer system using a first machine learning model, product feature weights

[{tilde over (r)}, 1] wherein: S(ϕ)[p,1]=Φ[p,{tilde over (r)}]·W[{tilde over (r)},1]; W[{tilde over (r)},1]=(Φ^(T)[{tilde over (r)},p _(t)]·Φ[p _(t) ,{tilde over (r)}]+αI[{tilde over (r)},{tilde over (r)}])⁻¹·Φ^(T)[{tilde over (r)},p _(t)]·S[p _(t),1]; {tilde over (r)} is a number of transformed features from r input features; and transformed features Φ[p, {tilde over (r)}] of a log associated with ϕ[p,r] are used as input into training of the first machine learning model.
 21. The computer system of claim 20, wherein the transformed features Φ[p, {tilde over (r)}] are obtained by: $\overset{p}{\bigcup\limits_{k = 1}}{{\varphi (k)} \otimes {{\varphi (k)}.}}$
 22. The computer system of claim 20, wherein the first machine learning model comprises a pointwise training discipline using a linear model.
 23. The computer system of claim 16, wherein the step (g) of calculating recommendation scores comprises approximating, by the computer system using a first machine learning model, a linear function

_(a2) to learn attribute-and-attribute value weights W_(a) wherein:

_(a2)(P,M,R,W _(a))[n,1]=U[n,r _(a)]·

[

,1] ϕ_(a2)[n,r _(a)]=U[n,r _(a)]; and a log associated with ϕ_(a2) is used as an input into training of the first machine learning model.
 24. The computer system of claim 23, wherein the first machine learning model comprises a pointwise training discipline using a linear model.
 25. The computer system of claim 23, wherein the step (g) of calculating recommendation scores further comprises approximating, by the computer system using a second machine learning model, a feature transformation function Φ[p,r _(p)]=ϕ_(p)[p,r _(p)]×((P|M)_(∈)[p,n]·

_(a2)(ϕ_(a2)[p,r _(a)])[n,1]) to learn the product recommender weights W_(p), wherein: S[p,1]=Φ[p,r _(p)]·W _(p)[r _(p),1]; and ϕ_(p)[p,r _(p)]={circumflex over (Q)} _(∈)[p,

].
 26. The computer system of claim 25, wherein the second machine learning model is a pointwise training discipline using a linear model.
 27. The computer system of claim 16, wherein the step (g) of calculating recommendation scores comprises computing, by the computer system, a vector of attribute value scores V[n, 1] in accordance with: V[n,1]=U[n,r _(a)]·

[

,1].
 28. The computer system of claim 27, wherein the step (g) of calculating recommendation scores further comprises computing, by the computer system, attribute scores

(V) by summing each block in accordance with:

(V)=A[a,1]=M[a,n]·V[n,1].
 29. The computer system of claim 27, wherein the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of cumulative attribute value ranks I[n, 1] in accordance with: ${I\left\lbrack {n,1} \right\rbrack} = {{hi\_ sort}\; {\left( {{{V\left\lbrack {n,1} \right\rbrack} \times {\sum\limits_{p}{\left( {{\left( {PM} \right)_{\epsilon}\left\lbrack {p,n} \right\rbrack} \times {B_{\epsilon}\left\lbrack {p,1} \right\rbrack}} \right)^{T}\left\lbrack {n,p} \right\rbrack}}},{M\left\lbrack {a,n} \right\rbrack},{DESC}} \right).}}$
 30. The computer system of claim 27, wherein the step (i) of determining an order of facets and facet values comprises generating, by the computer system, a vector of discriminative attribute value ranks I[n, 1] in accordance with: I[n,1]=hi_sort(V[n,1]×(1+ε−

^(T))[n,1],M[a,n],DESC), wherein ε[n, 1] is a vector having a very small value ε for each element to facilitate a tie-break, ${{\left( {{G_{1}\left\lbrack {1,n} \right\rbrack} - {G_{2}\left\lbrack {1,n} \right\rbrack}} \right) \cdot {V\left\lbrack {n,1} \right\rbrack}} = {{\Delta \; {{G\left\lbrack {1,n} \right\rbrack} \cdot {V\left\lbrack {n,1} \right\rbrack}}} = {\sum\limits_{i = 1}^{n}{\left( {{G_{1}(i)} - {G_{2}(i)}} \right)*{V(i)}}}}},\mspace{79mu} {{G_{1}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{1}}{\mathcal{F}_{p_{1}}\left\lbrack {p_{1},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}{\left( {PM} \right)_{\epsilon_{1}}\left\lbrack {p_{1},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ $\mspace{79mu} {{G_{2}\left\lbrack {1,n} \right\rbrack} = {{\left( {\sum\limits_{p_{2}}{\mathcal{F}_{p_{2}}\left\lbrack {p_{2},1} \right\rbrack}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}{\left( {PM} \right)_{\epsilon_{2}}\left\lbrack {p_{2},n} \right\rbrack}} \right)\left\lbrack {1,n} \right\rbrack}}}$ ${\Delta \; {G\left\lbrack {1,n} \right\rbrack}} = {{{\left( {\sum\limits_{p_{1}}B_{e}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{1}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}} - {{\left( {\sum\limits_{p_{2}}B_{e}} \right)\left\lbrack {1,1} \right\rbrack} \times {\left( {\sum\limits_{p_{2}}\left( {PM} \right)_{\epsilon}} \right)\left\lbrack {1,n} \right\rbrack}}}$ and wherein G₁ and G₂ are group vectors for a first group of products p₁ and a second group of products p₂, respectively, the first group of products p₁ and the second group of products p₂ being selected from the p products based on their recommendation scores in S[p, 1]. 